Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

11.0 stuck on high network load

54 views
Skip to first unread message

Slawa Olhovchenkov

unread,
Sep 4, 2016, 5:58:04 PM9/4/16
to freebsd...@freebsd.org
I am try using 11.0 on Dual E5-2620 (no X2APIC).
Under high network load and may be addtional conditional system go to
unresponsible state -- no reaction to network and console (USB IPMI
emulation). INVARIANTS give to high overhad. Is this exist some way to
debug this?
_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

hiren panchasara

unread,
Sep 4, 2016, 9:46:43 PM9/4/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> I am try using 11.0 on Dual E5-2620 (no X2APIC).
> Under high network load and may be addtional conditional system go to
> unresponsible state -- no reaction to network and console (USB IPMI
> emulation). INVARIANTS give to high overhad. Is this exist some way to
> debug this?

Can you panic it from console to get to db> to get backtrace and other
info when it goes unresponsive?

Cheers,
Hiren

Slawa Olhovchenkov

unread,
Sep 5, 2016, 3:44:19 AM9/5/16
to hiren panchasara, freebsd...@freebsd.org
no
no reaction

> Cheers,
> Hiren

Warner Losh

unread,
Sep 5, 2016, 12:15:30 PM9/5/16
to Slawa Olhovchenkov, FreeBSD-STABLE Mailing List, hiren panchasara
On Mon, Sep 5, 2016 at 1:43 AM, Slawa Olhovchenkov <s...@zxy.spb.ru> wrote:
> On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
>
>> On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
>> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
>> > Under high network load and may be addtional conditional system go to
>> > unresponsible state -- no reaction to network and console (USB IPMI
>> > emulation). INVARIANTS give to high overhad. Is this exist some way to
>> > debug this?
>>
>> Can you panic it from console to get to db> to get backtrace and other
>> info when it goes unresponsive?
>
> no
> no reaction

So the canonical 'ipmitool chassis power diag' doesn't send an NMI to
get you to the debugger?

I've seen this at Netflix on one variant of our flash offload box with
a Intel e5-2697v2 running with the Chelsio driver. We're working
around it by having fewer receive threads than CPUs in the system. The
only way the boxes would come back was with watchdog. The load was
streaming video > ~36Gbps out 4 lagged 10G ports. Console is totally
unresponsive as well. This is on our FreeBSD-10 stable based fork.
From my debugging, we go from totally fine as far as I can tell from
ps, etc in the moments leading to the hang to being totally wedged. It
seems a very sudden-onset condition. Sound at all familiar?

Warner

Slawa Olhovchenkov

unread,
Sep 5, 2016, 12:47:28 PM9/5/16
to Warner Losh, FreeBSD-STABLE Mailing List, hiren panchasara
On Mon, Sep 05, 2016 at 10:14:59AM -0600, Warner Losh wrote:

> On Mon, Sep 5, 2016 at 1:43 AM, Slawa Olhovchenkov <s...@zxy.spb.ru> wrote:
> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> >
> >> On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> >> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> >> > Under high network load and may be addtional conditional system go to
> >> > unresponsible state -- no reaction to network and console (USB IPMI
> >> > emulation). INVARIANTS give to high overhad. Is this exist some way to
> >> > debug this?
> >>
> >> Can you panic it from console to get to db> to get backtrace and other
> >> info when it goes unresponsive?
> >
> > no
> > no reaction
>
> So the canonical 'ipmitool chassis power diag' doesn't send an NMI to
> get you to the debugger?

Don't try (and don't know about this).
Can you some explain?
Is this FreeBSD by default catch NMI and enter to debugger?
How to interoperable with USB stack (I am beware USB keyboard may be locked)?

> I've seen this at Netflix on one variant of our flash offload box with
> a Intel e5-2697v2 running with the Chelsio driver. We're working
> around it by having fewer receive threads than CPUs in the system. The
> only way the boxes would come back was with watchdog. The load was
> streaming video > ~36Gbps out 4 lagged 10G ports. Console is totally
> unresponsive as well. This is on our FreeBSD-10 stable based fork.
> >From my debugging, we go from totally fine as far as I can tell from
> ps, etc in the moments leading to the hang to being totally wedged. It
> seems a very sudden-onset condition. Sound at all familiar?
>
> Warner

Not sure.
This is less power box and can be servered only 20Gbit, using Intel
card (lagg 2x10H). Day ago I am using on this box 10-STABLE w/o such
issuse. (Not cleancly remember, may be some month ago this box crashed
by this issuse -- at the that time I am don't have any ideas about crash)

May be stuck caused by some poor (too big) memory request from nginx
(atempt parsing some malformed files). Or frequent nginx core dump
(from this malformed files).

11.0 on two different more power box servered from 40 to 55Gbit w/o stuck.
But w/o malformed files (t.e. w/o bogus memory request and w/o nginx
crash). Not sure about correlation.

Warner Losh

unread,
Sep 5, 2016, 1:50:59 PM9/5/16
to Slawa Olhovchenkov, FreeBSD-STABLE Mailing List, hiren panchasara
On Mon, Sep 5, 2016 at 10:46 AM, Slawa Olhovchenkov <s...@zxy.spb.ru> wrote:
> On Mon, Sep 05, 2016 at 10:14:59AM -0600, Warner Losh wrote:
>
>> On Mon, Sep 5, 2016 at 1:43 AM, Slawa Olhovchenkov <s...@zxy.spb.ru> wrote:
>> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
>> >
>> >> On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
>> >> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
>> >> > Under high network load and may be addtional conditional system go to
>> >> > unresponsible state -- no reaction to network and console (USB IPMI
>> >> > emulation). INVARIANTS give to high overhad. Is this exist some way to
>> >> > debug this?
>> >>
>> >> Can you panic it from console to get to db> to get backtrace and other
>> >> info when it goes unresponsive?
>> >
>> > no
>> > no reaction
>>
>> So the canonical 'ipmitool chassis power diag' doesn't send an NMI to
>> get you to the debugger?
>
> Don't try (and don't know about this).
> Can you some explain?

The BCM sends the NMI to the CPU.

> Is this FreeBSD by default catch NMI and enter to debugger?

Yes.

> How to interoperable with USB stack (I am beware USB keyboard may be locked)?

I've just done serial console, so I'm not sure. I think that it works...

>> I've seen this at Netflix on one variant of our flash offload box with
>> a Intel e5-2697v2 running with the Chelsio driver. We're working
>> around it by having fewer receive threads than CPUs in the system. The
>> only way the boxes would come back was with watchdog. The load was
>> streaming video > ~36Gbps out 4 lagged 10G ports. Console is totally
>> unresponsive as well. This is on our FreeBSD-10 stable based fork.
>> >From my debugging, we go from totally fine as far as I can tell from
>> ps, etc in the moments leading to the hang to being totally wedged. It
>> seems a very sudden-onset condition. Sound at all familiar?
>>
>> Warner
>
> Not sure.
> This is less power box and can be servered only 20Gbit, using Intel
> card (lagg 2x10H). Day ago I am using on this box 10-STABLE w/o such
> issuse. (Not cleancly remember, may be some month ago this box crashed
> by this issuse -- at the that time I am don't have any ideas about crash)

OK.

> May be stuck caused by some poor (too big) memory request from nginx
> (atempt parsing some malformed files). Or frequent nginx core dump
> (from this malformed files).

OK. We're using nginx too, with our modified sendfile.

> 11.0 on two different more power box servered from 40 to 55Gbit w/o stuck.
> But w/o malformed files (t.e. w/o bogus memory request and w/o nginx
> crash). Not sure about correlation.

In our case it seems like a timing issue between too many threads. The
same hardware can handle 1x40G no probem...

Warner

Slawa Olhovchenkov

unread,
Sep 5, 2016, 2:03:05 PM9/5/16
to Warner Losh, FreeBSD-STABLE Mailing List, hiren panchasara
On Mon, Sep 05, 2016 at 11:50:28AM -0600, Warner Losh wrote:

> > How to interoperable with USB stack (I am beware USB keyboard may be locked)?
>
> I've just done serial console, so I'm not sure. I think that it works...

IPMI or hardware?

> >> I've seen this at Netflix on one variant of our flash offload box with
> >> a Intel e5-2697v2 running with the Chelsio driver. We're working
> >> around it by having fewer receive threads than CPUs in the system. The
> >> only way the boxes would come back was with watchdog. The load was
> >> streaming video > ~36Gbps out 4 lagged 10G ports. Console is totally
> >> unresponsive as well. This is on our FreeBSD-10 stable based fork.
> >> >From my debugging, we go from totally fine as far as I can tell from
> >> ps, etc in the moments leading to the hang to being totally wedged. It
> >> seems a very sudden-onset condition. Sound at all familiar?
> >>
> >> Warner
> >
> > Not sure.
> > This is less power box and can be servered only 20Gbit, using Intel
> > card (lagg 2x10H). Day ago I am using on this box 10-STABLE w/o such
> > issuse. (Not cleancly remember, may be some month ago this box crashed
> > by this issuse -- at the that time I am don't have any ideas about crash)
>
> OK.
>
> > May be stuck caused by some poor (too big) memory request from nginx
> > (atempt parsing some malformed files). Or frequent nginx core dump
> > (from this malformed files).
>
> OK. We're using nginx too, with our modified sendfile.

I am don't use sendfile and use ZFS.

> > 11.0 on two different more power box servered from 40 to 55Gbit w/o stuck.
> > But w/o malformed files (t.e. w/o bogus memory request and w/o nginx
> > crash). Not sure about correlation.
>
> In our case it seems like a timing issue between too many threads. The
> same hardware can handle 1x40G no probem...

I am already reconfigure NIC to have total number of recive thread as
half of total CPU cores. I am don't see high number of AIO tasks
immediate before stuck (around 100).

Slawa Olhovchenkov

unread,
Sep 6, 2016, 5:04:24 PM9/6/16
to Warner Losh, FreeBSD-STABLE Mailing List, hiren panchasara
On Mon, Sep 05, 2016 at 10:14:59AM -0600, Warner Losh wrote:

> On Mon, Sep 5, 2016 at 1:43 AM, Slawa Olhovchenkov <s...@zxy.spb.ru> wrote:
> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> >
> >> On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> >> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> >> > Under high network load and may be addtional conditional system go to
> >> > unresponsible state -- no reaction to network and console (USB IPMI
> >> > emulation). INVARIANTS give to high overhad. Is this exist some way to
> >> > debug this?
> >>
> >> Can you panic it from console to get to db> to get backtrace and other
> >> info when it goes unresponsive?
> >
> > no
> > no reaction
>
> So the canonical 'ipmitool chassis power diag' doesn't send an NMI to
> get you to the debugger?

This supermicro MB don't interact with ipmitool over lan :(
either chassis power diag and sol

Slawa Olhovchenkov

unread,
Sep 14, 2016, 5:35:31 PM9/14/16
to hiren panchasara, freebsd...@freebsd.org
On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:

ipmi console don't respond (chassis power diag don't react)
login on sol console stuck on *tcp.

hiren panchasara

unread,
Sep 14, 2016, 5:43:30 PM9/14/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
On 09/15/16 at 12:35P, Slawa Olhovchenkov wrote:
> On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
>
> > On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > > Under high network load and may be addtional conditional system go to
> > > unresponsible state -- no reaction to network and console (USB IPMI
> > > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > > debug this?
> >
> > Can you panic it from console to get to db> to get backtrace and other
> > info when it goes unresponsive?
>
> ipmi console don't respond (chassis power diag don't react)
> login on sol console stuck on *tcp.

I assume you tried ~^b (tilda followed by ctrl+b) without success?

That usually drops into db>

I am also fighting an issue where upon said keystrokes, I see
"KDB: enter: Break to debugger" but it doesn't drop to db>
At that point I have to 'ipmitool blah power reset' the box.

Cheers,
Hiren

hiren panchasara

unread,
Sep 14, 2016, 5:57:22 PM9/14/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
On 09/15/16 at 12:35P, Slawa Olhovchenkov wrote:
> On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
>
> > On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > > Under high network load and may be addtional conditional system go to
> > > unresponsible state -- no reaction to network and console (USB IPMI
> > > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > > debug this?
> >
> > Can you panic it from console to get to db> to get backtrace and other
> > info when it goes unresponsive?
>
> ipmi console don't respond (chassis power diag don't react)
> login on sol console stuck on *tcp.

Also *tcp means its stuck on lock tcp? if so, that'd be lock on
V_tcbinfo. I think?

tcp_subr.c has tcp_init() which calls
in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, hashsize, hashsize,
"tcp_inpcb", tcp_inpcb_init, NULL, 0, IPI_HASHFIELDS_4TUPLE);

and "tcp" is the name used to initialise the lock inside
in_pcbinfo_init() with
INP_INFO_LOCK_INIT(pcbinfo, name);

What exact svn rev are you on? Also do you have any local changes?

Cheers,
Hiren

Slawa Olhovchenkov

unread,
Sep 14, 2016, 5:57:39 PM9/14/16
to hiren panchasara, freebsd...@freebsd.org
On Wed, Sep 14, 2016 at 02:43:06PM -0700, hiren panchasara wrote:

> On 09/15/16 at 12:35P, Slawa Olhovchenkov wrote:
> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> >
> > > On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > > > Under high network load and may be addtional conditional system go to
> > > > unresponsible state -- no reaction to network and console (USB IPMI
> > > > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > > > debug this?
> > >
> > > Can you panic it from console to get to db> to get backtrace and other
> > > info when it goes unresponsive?
> >
> > ipmi console don't respond (chassis power diag don't react)
> > login on sol console stuck on *tcp.
>
> I assume you tried ~^b (tilda followed by ctrl+b) without success?

~B, as in man ipmitool

> That usually drops into db>

May be now need some sysctl for enable this?

> I am also fighting an issue where upon said keystrokes, I see
> "KDB: enter: Break to debugger" but it doesn't drop to db>
> At that point I have to 'ipmitool blah power reset' the box.
>
> Cheers,
> Hiren


Slawa Olhovchenkov

unread,
Sep 14, 2016, 6:00:32 PM9/14/16
to hiren panchasara, freebsd...@freebsd.org
r305117

> Also do you have any local changes?

only kerberos related.
I am using kerberos and NIS.

hiren panchasara

unread,
Sep 14, 2016, 6:04:43 PM9/14/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
On 09/15/16 at 12:57P, Slawa Olhovchenkov wrote:
> On Wed, Sep 14, 2016 at 02:43:06PM -0700, hiren panchasara wrote:
>
> > On 09/15/16 at 12:35P, Slawa Olhovchenkov wrote:
> > > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> > >
> > > > On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > > > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > > > > Under high network load and may be addtional conditional system go to
> > > > > unresponsible state -- no reaction to network and console (USB IPMI
> > > > > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > > > > debug this?
> > > >
> > > > Can you panic it from console to get to db> to get backtrace and other
> > > > info when it goes unresponsive?
> > >
> > > ipmi console don't respond (chassis power diag don't react)
> > > login on sol console stuck on *tcp.
> >
> > I assume you tried ~^b (tilda followed by ctrl+b) without success?
>
> ~B, as in man ipmitool

No, not shift-b but ctrl-b.

I am not aware of ipmitool reference. On unresponsive console, try
~^b (tilda followed by ctrl+b)

>
> > That usually drops into db>
>
> May be now need some sysctl for enable this?

There is a sysctl for this too but on console, the keystrokes I said
should work, imo.

Cheers,
Hiren

Slawa Olhovchenkov

unread,
Sep 14, 2016, 6:14:02 PM9/14/16
to hiren panchasara, freebsd...@freebsd.org
On Wed, Sep 14, 2016 at 03:04:20PM -0700, hiren panchasara wrote:

> On 09/15/16 at 12:57P, Slawa Olhovchenkov wrote:
> > On Wed, Sep 14, 2016 at 02:43:06PM -0700, hiren panchasara wrote:
> >
> > > On 09/15/16 at 12:35P, Slawa Olhovchenkov wrote:
> > > > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> > > >
> > > > > On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > > > > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > > > > > Under high network load and may be addtional conditional system go to
> > > > > > unresponsible state -- no reaction to network and console (USB IPMI
> > > > > > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > > > > > debug this?
> > > > >
> > > > > Can you panic it from console to get to db> to get backtrace and other
> > > > > info when it goes unresponsive?
> > > >
> > > > ipmi console don't respond (chassis power diag don't react)
> > > > login on sol console stuck on *tcp.
> > >
> > > I assume you tried ~^b (tilda followed by ctrl+b) without success?
> >
> > ~B, as in man ipmitool
>
> No, not shift-b but ctrl-b.
>
> I am not aware of ipmitool reference. On unresponsive console, try
> ~^b (tilda followed by ctrl+b)

ipmitool ~B send break, do you talk about break or about special
serial console keystroke to enter debuger, distinct from break?

> >
> > > That usually drops into db>
> >
> > May be now need some sysctl for enable this?
>
> There is a sysctl for this too but on console, the keystrokes I said
> should work, imo.
>
> Cheers,
> Hiren


Oliver Pinter

unread,
Sep 14, 2016, 6:17:49 PM9/14/16
to Warner Losh, FreeBSD-STABLE Mailing List, hiren panchasara, Slawa Olhovchenkov
On Monday, September 5, 2016, Warner Losh <i...@bsdimp.com
<javascript:_e(%7B%7D,'cvml','i...@bsdimp.com');>> wrote:

> On Mon, Sep 5, 2016 at 1:43 AM, Slawa Olhovchenkov <s...@zxy.spb.ru> wrote:
> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> >
> >> On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> >> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> >> > Under high network load and may be addtional conditional system go to
> >> > unresponsible state -- no reaction to network and console (USB IPMI
> >> > emulation). INVARIANTS give to high overhad. Is this exist some way to
> >> > debug this?
> >>
> >> Can you panic it from console to get to db> to get backtrace and other
> >> info when it goes unresponsive?
> >
> > no
> > no reaction
>
> So the canonical 'ipmitool chassis power diag' doesn't send an NMI to
> get you to the debugger?
>
> I've seen this at Netflix on one variant of our flash offload box with
> a Intel e5-2697v2 running with the Chelsio driver. We're working
> around it by having fewer receive threads than CPUs in the system. The
> only way the boxes would come back was with watchdog. The load was
> streaming video > ~36Gbps out 4 lagged 10G ports. Console is totally
> unresponsive as well.


> Try to set kern.sched.preempt_thresh sysctl to 224 to get back your
> console..

Gary Palmer

unread,
Sep 14, 2016, 6:23:46 PM9/14/16
to Slawa Olhovchenkov, freebsd...@freebsd.org, hiren panchasara
On Thu, Sep 15, 2016 at 01:13:35AM +0300, Slawa Olhovchenkov wrote:
> On Wed, Sep 14, 2016 at 03:04:20PM -0700, hiren panchasara wrote:
>
> > On 09/15/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > On Wed, Sep 14, 2016 at 02:43:06PM -0700, hiren panchasara wrote:
> > >
> > > > On 09/15/16 at 12:35P, Slawa Olhovchenkov wrote:
> > > > > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> > > > >
> > > > > > On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > > > > > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > > > > > > Under high network load and may be addtional conditional system go to
> > > > > > > unresponsible state -- no reaction to network and console (USB IPMI
> > > > > > > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > > > > > > debug this?
> > > > > >
> > > > > > Can you panic it from console to get to db> to get backtrace and other
> > > > > > info when it goes unresponsive?
> > > > >
> > > > > ipmi console don't respond (chassis power diag don't react)
> > > > > login on sol console stuck on *tcp.
> > > >
> > > > I assume you tried ~^b (tilda followed by ctrl+b) without success?
> > >
> > > ~B, as in man ipmitool
> >
> > No, not shift-b but ctrl-b.
> >
> > I am not aware of ipmitool reference. On unresponsive console, try
> > ~^b (tilda followed by ctrl+b)
>
> ipmitool ~B send break, do you talk about break or about special
> serial console keystroke to enter debuger, distinct from break?

A lot of old console servers used to send break when they reset or
on boot, so the FreeBSD console does not break to debugger when
a break command is received. That's why the tidla CTRL-b is
required as it's sufficiently unlikely to be accidentally sent by
equipment connected to the serial console.

Regards,

Gary

Slawa Olhovchenkov

unread,
Sep 14, 2016, 6:25:59 PM9/14/16
to Gary Palmer, freebsd...@freebsd.org, hiren panchasara
OK, thanks to clarification, I am try ~^B on next attempt.

Oleksandr V. Typlyns'kyi

unread,
Sep 14, 2016, 7:33:42 PM9/14/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
Sep 5 Sep 5, 2016 at 00:57 Slawa Olhovchenkov wrote:

> I am try using 11.0 on Dual E5-2620 (no X2APIC).
> Under high network load and may be addtional conditional system go to
> unresponsible state -- no reaction to network and console (USB IPMI
> emulation). INVARIANTS give to high overhad. Is this exist some way to
> debug this?

Do you have any routing table changes?
In my case system totally hangs after route change command invocation.
Sequence of commands route del + route add works fine.
Maybe it is FIB related, but with -fib option system hangs too.

--
WNGS-RIPE

Slawa Olhovchenkov

unread,
Sep 14, 2016, 7:51:37 PM9/14/16
to Oleksandr V. Typlyns'kyi, freebsd...@freebsd.org
On Thu, Sep 15, 2016 at 02:33:07AM +0300, Oleksandr V. Typlyns'kyi wrote:

> Sep 5 Sep 5, 2016 at 00:57 Slawa Olhovchenkov wrote:
>
> > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > Under high network load and may be addtional conditional system go to
> > unresponsible state -- no reaction to network and console (USB IPMI
> > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > debug this?
>
> Do you have any routing table changes?
> In my case system totally hangs after route change command invocation.
> Sequence of commands route del + route add works fine.
> Maybe it is FIB related, but with -fib option system hangs too.

No, only default exist and don't changed

Konstantin Belousov

unread,
Sep 15, 2016, 5:00:12 AM9/15/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
On Thu, Sep 15, 2016 at 12:35:04AM +0300, Slawa Olhovchenkov wrote:
> On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
>
> > On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > > Under high network load and may be addtional conditional system go to
> > > unresponsible state -- no reaction to network and console (USB IPMI
> > > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > > debug this?
> >
> > Can you panic it from console to get to db> to get backtrace and other
> > info when it goes unresponsive?
>
> ipmi console don't respond (chassis power diag don't react)
> login on sol console stuck on *tcp.

Is 'login' you reference is the ipmi client state, or you mean login(1)
on the wedged host ?

If BMC stops responding simultaneously with the host, I would suspect
the hardware platform issues instead of a software problem. Do you have
dedicated LAN port for BMC ?

Slawa Olhovchenkov

unread,
Sep 15, 2016, 5:06:59 AM9/15/16
to Konstantin Belousov, freebsd...@freebsd.org
On Thu, Sep 15, 2016 at 11:59:38AM +0300, Konstantin Belousov wrote:

> On Thu, Sep 15, 2016 at 12:35:04AM +0300, Slawa Olhovchenkov wrote:
> > On Sun, Sep 04, 2016 at 06:46:12PM -0700, hiren panchasara wrote:
> >
> > > On 09/05/16 at 12:57P, Slawa Olhovchenkov wrote:
> > > > I am try using 11.0 on Dual E5-2620 (no X2APIC).
> > > > Under high network load and may be addtional conditional system go to
> > > > unresponsible state -- no reaction to network and console (USB IPMI
> > > > emulation). INVARIANTS give to high overhad. Is this exist some way to
> > > > debug this?
> > >
> > > Can you panic it from console to get to db> to get backtrace and other
> > > info when it goes unresponsive?
> >
> > ipmi console don't respond (chassis power diag don't react)
> > login on sol console stuck on *tcp.
>
> Is 'login' you reference is the ipmi client state, or you mean login(1)
> on the wedged host ?

on the wedged host

> If BMC stops responding simultaneously with the host, I would suspect
> the hardware platform issues instead of a software problem. Do you have
> dedicated LAN port for BMC ?

Yes.
But BMC emulate USB keyboard and this is may be lock inside USB
system.
"ipmi console don't respond" must be read as "ipmi console runnnig and
attached but system don't react to keypress on this console".
at the sime moment system respon to `enter` on ipmi sol console, but
after enter `root` stuck in login in the '*tcp' state (I think this is
NIS related).

Slawa Olhovchenkov

unread,
Sep 16, 2016, 2:19:10 PM9/16/16
to Konstantin Belousov, freebsd...@freebsd.org
~^B don't break to debuger.
But I can login to sol console.
All system work tooooo slooooow.
Some cores displayed by top as 100% idle.
Some cores displayed by top as 100% interrupt.

last pid: 16631; load averages: 9.00, 9.08, 9.35 up 0+06:26:34 19:53:13
832 processes: 22 running, 765 sleeping, 42 waiting, 3 lock
CPU 0: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% idle
CPU 1: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 2: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 3: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% idle
CPU 4: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% idle
CPU 5: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 6: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% idle
CPU 7: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% idle
CPU 8: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 9: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 10: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% idle
CPU 11: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 1564M Active, 5169M Inact, 114G Wired, 4548M Free
ARC: 107G Total, 98G MFU, 9476M MRU, 160K Anon, 150M Header, 254M Other
Swap: 32G Total, 32G Free

Now I am trying to

# pmcstat -c 0 -S CPU_CLK_UNHALTED_CORE -l 10 -O sample0.out

load: 10.32 cmd: pmcstat 16878 [runnable] 4632.42r 0.00u 0.00s 0% 2940k

I am still waiting

Eugene Grosbein

unread,
Sep 16, 2016, 2:23:44 PM9/16/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
17.09.2016 1:18, Slawa Olhovchenkov пишет:

> ~^B don't break to debuger.

Make sure your kernel config has:

# Solaris implements a new BREAK which is initiated by a character
# sequence CR ~ ^b which is similar to a familiar pattern used on
# Sun servers by the Remote Console. There are FreeBSD extensions:
# CR ~ ^p requests force panic and CR ~ ^r requests a clean reboot.
options ALT_BREAK_TO_DEBUGGER

And don't forget that tilde (~) should follow CR (Enter).

hiren panchasara

unread,
Sep 16, 2016, 2:31:22 PM9/16/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org
You can probably:
debug.kdb.enter: set to enter the debugger

or force a panic and get vmcore:
debug.kdb.panic: set to panic the kernel

Cheers,
Hiren

Slawa Olhovchenkov

unread,
Sep 16, 2016, 2:42:57 PM9/16/16
to hiren panchasara, Konstantin Belousov, freebsd...@freebsd.org
I am still waiting to exit pmcstat.
Oh, for NMI need not `ipmitool chassis power diag`!
need `ipmitool power diag`!
But debugger not entered:

^C^C^C^C^C^C^C^CNMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger
NMI ... going to debugger

load: 9.91 cmd: pmcstat 16878 [runnable] 5930.57r 0.00u 0.00s 0% 2940k

Slawa Olhovchenkov

unread,
Sep 16, 2016, 3:03:58 PM9/16/16
to hiren panchasara, Konstantin Belousov, freebsd...@freebsd.org
I am reset this host.
PMC samples collected and decoded:

@ CPU_CLK_UNHALTED_CORE [4653445 samples]

51.86% [2413083] lock_delay @ /boot/kernel.VSTREAM/kernel
100.0% [2413083] __rw_wlock_hard
100.0% [2413083] tcp_tw_2msl_scan
99.99% [2412958] pfslowtimo
100.0% [2412958] softclock_call_cc
100.0% [2412958] softclock
100.0% [2412958] intr_event_execute_handlers
100.0% [2412958] ithread_loop
100.0% [2412958] fork_exit
00.01% [125] tcp_twstart
100.0% [125] tcp_do_segment
100.0% [125] tcp_input
100.0% [125] ip_input
100.0% [125] swi_net
100.0% [125] intr_event_execute_handlers
100.0% [125] ithread_loop
100.0% [125] fork_exit

09.43% [438774] _rw_runlock_cookie @ /boot/kernel.VSTREAM/kernel
100.0% [438774] tcp_tw_2msl_scan
99.99% [438735] pfslowtimo
100.0% [438735] softclock_call_cc
100.0% [438735] softclock
100.0% [438735] intr_event_execute_handlers
100.0% [438735] ithread_loop
100.0% [438735] fork_exit
00.01% [39] tcp_twstart
100.0% [39] tcp_do_segment
100.0% [39] tcp_input
100.0% [39] ip_input
100.0% [39] swi_net
100.0% [39] intr_event_execute_handlers
100.0% [39] ithread_loop
100.0% [39] fork_exit

08.57% [398970] __rw_wlock_hard @ /boot/kernel.VSTREAM/kernel
100.0% [398970] tcp_tw_2msl_scan
99.99% [398940] pfslowtimo
100.0% [398940] softclock_call_cc
100.0% [398940] softclock
100.0% [398940] intr_event_execute_handlers
100.0% [398940] ithread_loop
100.0% [398940] fork_exit
00.01% [30] tcp_twstart
100.0% [30] tcp_do_segment
100.0% [30] tcp_input
100.0% [30] ip_input
100.0% [30] swi_net
100.0% [30] intr_event_execute_handlers
100.0% [30] ithread_loop
100.0% [30] fork_exit

05.79% [269224] __rw_try_rlock @ /boot/kernel.VSTREAM/kernel
100.0% [269224] tcp_tw_2msl_scan
99.99% [269203] pfslowtimo
100.0% [269203] softclock_call_cc
100.0% [269203] softclock
100.0% [269203] intr_event_execute_handlers
100.0% [269203] ithread_loop
100.0% [269203] fork_exit
00.01% [21] tcp_twstart
100.0% [21] tcp_do_segment
100.0% [21] tcp_input
100.0% [21] ip_input
100.0% [21] swi_net
100.0% [21] intr_event_execute_handlers
100.0% [21] ithread_loop
100.0% [21] fork_exit

05.35% [249141] _rw_wlock_cookie @ /boot/kernel.VSTREAM/kernel
99.76% [248543] tcp_tw_2msl_scan
99.99% [248528] pfslowtimo
100.0% [248528] softclock_call_cc
100.0% [248528] softclock
100.0% [248528] intr_event_execute_handlers
100.0% [248528] ithread_loop
100.0% [248528] fork_exit
00.01% [15] tcp_twstart
100.0% [15] tcp_do_segment
100.0% [15] tcp_input
100.0% [15] ip_input
100.0% [15] swi_net
100.0% [15] intr_event_execute_handlers
100.0% [15] ithread_loop
100.0% [15] fork_exit
00.24% [598] pfslowtimo
100.0% [598] softclock_call_cc
100.0% [598] softclock
100.0% [598] intr_event_execute_handlers
100.0% [598] ithread_loop
100.0% [598] fork_exit

hiren panchasara

unread,
Sep 16, 2016, 3:12:24 PM9/16/16
to Slawa Olhovchenkov, j...@freebsd.org, Konstantin Belousov, freebsd...@freebsd.org
+ jch@
As I suspected, this looks like a hang trying to lock V_tcbinfo.

I'm ccing Julien here who worked on WLOCK -> RLOCK transition to improve
performance for short-lived connections. I am not too sure if thats the
problem but looks in similar area so he may be able to provide some
insights.

Cheers,
Hiren

Slawa Olhovchenkov

unread,
Sep 16, 2016, 4:31:27 PM9/16/16
to hiren panchasara, Konstantin Belousov, j...@freebsd.org, freebsd...@freebsd.org
On Fri, Sep 16, 2016 at 12:11:55PM -0700, hiren panchasara wrote:

>
> As I suspected, this looks like a hang trying to lock V_tcbinfo.
>
> I'm ccing Julien here who worked on WLOCK -> RLOCK transition to improve
> performance for short-lived connections. I am not too sure if thats the
> problem but looks in similar area so he may be able to provide some
> insights.

No, this is other case. In may case at this time no network traffic
more then hour. This is some sore of deadlock or like.

hiren panchasara

unread,
Sep 16, 2016, 5:47:13 PM9/16/16
to Slawa Olhovchenkov, Konstantin Belousov, j...@freebsd.org, freebsd...@freebsd.org
On 09/16/16 at 11:30P, Slawa Olhovchenkov wrote:
> On Fri, Sep 16, 2016 at 12:11:55PM -0700, hiren panchasara wrote:
>
> >
> > As I suspected, this looks like a hang trying to lock V_tcbinfo.
> >
> > I'm ccing Julien here who worked on WLOCK -> RLOCK transition to improve
> > performance for short-lived connections. I am not too sure if thats the
> > problem but looks in similar area so he may be able to provide some
> > insights.
>
> No, this is other case. In may case at this time no network traffic
> more then hour. This is some sore of deadlock or like.

In my limited understanding, such deadlock like situation can occur
without traffic too.

Cheers,
Hiren

hiren panchasara

unread,
Sep 16, 2016, 5:49:13 PM9/16/16
to Slawa Olhovchenkov, Konstantin Belousov, j...@freebsd.org, freebsd...@freebsd.org
Err, I meant to say light traffic.

Cheers,
Hiren

Slawa Olhovchenkov

unread,
Sep 17, 2016, 12:33:13 PM9/17/16
to hiren panchasara, Konstantin Belousov, j...@freebsd.org, freebsd...@freebsd.org
I am point again -- I am dont have traffic totaly.
Absolytly no traffic

Slawa Olhovchenkov

unread,
Sep 18, 2016, 12:43:31 PM9/18/16
to hiren panchasara, Konstantin Belousov, j...@freebsd.org, freebsd...@freebsd.org
On Fri, Sep 16, 2016 at 12:11:55PM -0700, hiren panchasara wrote:

I am point to tcp_tw_2msl_scan.
I am expect traveling by list V_twq_2msl.
But I am see only endless attempt to lock first element from this
list.
Is this correct?

Hans Petter Selasky

unread,
Sep 18, 2016, 1:45:47 PM9/18/16
to s...@zxy.spb.ru, freebsd...@freebsd.org
Hi,

Got some tips regarding this thread.

Some things you can try:

1) Compile kernel from projects/hps_head instead of your 11-stable?

2) Set net.inet.tcp.per_cpu_timers=1

If the system just hangs, it is pretty likely that the timers are going
in a loop due to typical use after free.

Please keep me CC'ed, hence I'm not subscribed to @stable.

--HPS

Slawa Olhovchenkov

unread,
Sep 18, 2016, 2:10:45 PM9/18/16
to Hans Petter Selasky, freebsd...@freebsd.org
On Sun, Sep 18, 2016 at 07:50:08PM +0200, Hans Petter Selasky wrote:

> Hi,
>
> Got some tips regarding this thread.
>
> Some things you can try:
>
> 1) Compile kernel from projects/hps_head instead of your 11-stable?

How many difference from 11-stable?

> 2) Set net.inet.tcp.per_cpu_timers=1

Already. From 10.x, by manual MFC.

Hans Petter Selasky

unread,
Sep 18, 2016, 4:34:44 PM9/18/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
On 09/18/16 20:10, Slawa Olhovchenkov wrote:
> On Sun, Sep 18, 2016 at 07:50:08PM +0200, Hans Petter Selasky wrote:
>
>> Hi,
>>
>> Got some tips regarding this thread.
>>
>> Some things you can try:
>>
>> 1) Compile kernel from projects/hps_head instead of your 11-stable?
>
> How many difference from 11-stable?

Hi,

The callout subsystem has a different implementation. Else identical.

>
>> 2) Set net.inet.tcp.per_cpu_timers=1
>
> Already. From 10.x, by manual MFC.

OK.

Slawa Olhovchenkov

unread,
Sep 18, 2016, 4:46:45 PM9/18/16
to Hans Petter Selasky, freebsd...@freebsd.org
On Sun, Sep 18, 2016 at 10:38:58PM +0200, Hans Petter Selasky wrote:

> On 09/18/16 20:10, Slawa Olhovchenkov wrote:
> > On Sun, Sep 18, 2016 at 07:50:08PM +0200, Hans Petter Selasky wrote:
> >
> >> Hi,
> >>
> >> Got some tips regarding this thread.
> >>
> >> Some things you can try:
> >>
> >> 1) Compile kernel from projects/hps_head instead of your 11-stable?
> >
> > How many difference from 11-stable?
>
> Hi,
>
> The callout subsystem has a different implementation. Else identical.

userbase compatible?
can i recompile only kernel?

Hans Petter Selasky

unread,
Sep 19, 2016, 3:19:29 AM9/19/16
to Slawa Olhovchenkov, freebsd...@freebsd.org
On 09/18/16 22:46, Slawa Olhovchenkov wrote:
> userbase compatible?
> can i recompile only kernel?

Yes, only the kernel.

--HPS

Julien Charbon

unread,
Sep 19, 2016, 4:32:57 PM9/19/16
to Slawa Olhovchenkov, hiren panchasara, Konstantin Belousov, freebsd...@freebsd.org

Hi Slawa,
The only write lock tcp_tw_2msl_scan() tries to get is a
INP_WLOCK(inp). Thus here, tcp_tw_2msl_scan() seems to be stuck
spinning on INP_WLOCK (or pfslowtimo() is going crazy and calls
tcp_tw_2msl_scan() at high rate but this will be quite unexpected).

Thus my hypothesis is that something is holding the INP_WLOCK and not
releasing it, and tcp_tw_2msl_scan() is spinning on it.

If you can, could you compile the kernel with below options:

options DDB # Support DDB.
options DEADLKRES # Enable the deadlock resolver
options INVARIANTS # Enable calls of extra sanity
checking
options INVARIANT_SUPPORT # Extra sanity checks of internal
structures, required by INVARIANTS
options WITNESS # Enable checks to detect
deadlocks and cycles
options WITNESS_SKIPSPIN # Don't run witness on spinlocks
for speed

And once the issue is reproduce, run in ddb run the below commands:

show pcpu
show allpcpu
show locks
show alllocks
show lockchain
show allchains
show all trace

This is to see if the contention is indeed on the tcp_tw_2msl_scan's
INP_WLOCK.

--
Julien



signature.asc

Slawa Olhovchenkov

unread,
Sep 19, 2016, 4:44:02 PM9/19/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
Currently this host run with 100% CPU load (on all cores), i.e.
enabling WITNESS will be significant drop performance.
Can I use only some subset of options?

Also, I can some troubles to DDB enter in this case.
May be kgdb will be success (not tryed yet)?

> And once the issue is reproduce, run in ddb run the below commands:
>
> show pcpu
> show allpcpu
> show locks
> show alllocks
> show lockchain
> show allchains
> show all trace
>
> This is to see if the contention is indeed on the tcp_tw_2msl_scan's
> INP_WLOCK.

Slawa Olhovchenkov

unread,
Sep 20, 2016, 4:27:02 PM9/20/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Tue, Sep 20, 2016 at 10:00:25PM +0200, Julien Charbon wrote:

>
> Hi Slawa,
> If these kernel options will certainly slow down your kernel, they also
> might found the root cause of your issue before reaching the point where
> you have 100% cpu load on all cores (thanks to INVARIANTS). I would
> suggest:

Hmmm, may be I am not clarified.
This host run at peak hours with 100% CPU load as normal operation,
this is for servering 2x10G, this is CPU load not result of lock
issuse, this is not us case. And this is because I am fear to enable
WITNESS -- I am fear drop performance.

This lock issuse happen irregulary and may be caused by other issuse
(nginx crashed). In this case about 1/3 cores have 100% cpu load,
perhaps by this lock -- I am can trace only from one core and need
more then hour for this (may be on other cores different trace, I
can't guaranted anything).

> #1. Try above kernel options at least once, and see what you can get.

OK, I am try this after some time.

> #2. If #1 is a total failure try below patch: It won't solve anything,
> it just makes tcp_tw_2msl_scan() less greedy when there is contention on
> the INP write lock. If it makes the debugging more feasible, continue
> to #3.

OK, thanks.
What purpose to not skip locked tcptw in this loop?

> diff --git a/sys/netinet/tcp_timewait.c b/sys/netinet/tcp_timewait.c
> index a8b78f9..4206ea3 100644
> --- a/sys/netinet/tcp_timewait.c
> +++ b/sys/netinet/tcp_timewait.c
> @@ -701,34 +701,42 @@ tcp_tw_2msl_scan(int reuse)
> in_pcbref(inp);
> TW_RUNLOCK(V_tw_lock);
>
> +retry:
> if (INP_INFO_TRY_RLOCK(&V_tcbinfo)) {
>
> - INP_WLOCK(inp);
> - tw = intotw(inp);
> - if (in_pcbrele_wlocked(inp)) {
> - KASSERT(tw == NULL, ("%s: held last inp "
> - "reference but tw not NULL", __func__));
> - INP_INFO_RUNLOCK(&V_tcbinfo);
> - continue;
> - }
> + if (INP_TRY_WLOCK(inp)) {
> + tw = intotw(inp);
> + if (in_pcbrele_wlocked(inp)) {
> + KASSERT(tw == NULL, ("%s: held
> last inp "
> + "reference but tw not NULL",
> __func__));
> + INP_INFO_RUNLOCK(&V_tcbinfo);
> + continue;
> + }
>
> - if (tw == NULL) {
> - /* tcp_twclose() has already been called */
> - INP_WUNLOCK(inp);
> - INP_INFO_RUNLOCK(&V_tcbinfo);
> - continue;
> - }
> + if (tw == NULL) {
> + /* tcp_twclose() has already
> been called */
> + INP_WUNLOCK(inp);
> + INP_INFO_RUNLOCK(&V_tcbinfo);
> + continue;
> + }
>
> - tcp_twclose(tw, reuse);
> - INP_INFO_RUNLOCK(&V_tcbinfo);
> - if (reuse)
> - return tw;
> + tcp_twclose(tw, reuse);
> + INP_INFO_RUNLOCK(&V_tcbinfo);
> + if (reuse)
> + return tw;
> + } else {
> + INP_INFO_RUNLOCK(&V_tcbinfo);
> + goto retry;
> + }
> } else {
> /* INP_INFO lock is busy, continue later. */
> - INP_WLOCK(inp);
> - if (!in_pcbrele_wlocked(inp))
> - INP_WUNLOCK(inp);
> - break;
> + if (INP_TRY_WLOCK(inp)) {
> + if (!in_pcbrele_wlocked(inp))
> + INP_WUNLOCK(inp);
> + break;
> + } else {
> + goto retry;
> + }
> }
> }
>
> #3. Once the issue is reproduced, launch ddb and run the below commands:
>
> show pcpu
> show allpcpu
> show locks
> show alllocks
> show lockchain
> show allchains
> show all trace
>
> My 2 cents.
>
> --
> Julien

Julien Charbon

unread,
Sep 20, 2016, 4:59:56 PM9/20/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,

On 9/19/16 10:43 PM, Slawa Olhovchenkov wrote:
If these kernel options will certainly slow down your kernel, they also
might found the root cause of your issue before reaching the point where
you have 100% cpu load on all cores (thanks to INVARIANTS). I would
suggest:

#1. Try above kernel options at least once, and see what you can get.

#2. If #1 is a total failure try below patch: It won't solve anything,
it just makes tcp_tw_2msl_scan() less greedy when there is contention on
the INP write lock. If it makes the debugging more feasible, continue
to #3.

#3. Once the issue is reproduced, launch ddb and run the below commands:

show pcpu
show allpcpu
show locks
show alllocks
show lockchain
show allchains
show all trace

signature.asc

Julien Charbon

unread,
Sep 21, 2016, 3:23:53 AM9/21/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
I see, especially if you are running in production WITNESS might indeed
be not practical for you. In this case, I would suggest before doing
WITNESS and still get more information to:

#0: Do a lock profiling:

https://www.freebsd.org/cgi/man.cgi?query=LOCK_PROFILING

options LOCK_PROFILING

Example of usage:

# Run
$ sudo sysctl debug.lock.prof.enable=1
$ sleep 10
$ sudo sysctl debug.lock.prof.enable=0

# Get results
$ sysctl debug.lock.prof.stats | head -2; sysctl debug.lock.prof.stats |
sort -n -k 4 -r

You can also use Dtrace and lockstat (especially with the lockstat -s
option):

https://wiki.freebsd.org/DTrace/One-Liners#Kernel_Locks
https://www.freebsd.org/cgi/man.cgi?query=lockstat&manpath=FreeBSD+11.0-RELEASE

But I am less familiar with Dtrace/lockstat tools.

>> #1. Try above kernel options at least once, and see what you can get.
>
> OK, I am try this after some time.
>
>> #2. If #1 is a total failure try below patch: It won't solve anything,
>> it just makes tcp_tw_2msl_scan() less greedy when there is contention on
>> the INP write lock. If it makes the debugging more feasible, continue
>> to #3.
>
> OK, thanks.
> What purpose to not skip locked tcptw in this loop?

If I understand your question correctly: According to your pmcstat
result, tcp_tw_2msl_scan() currently struggles with a write lock
(__rw_wlock_hard) and the only write lock used tcp_tw_2msl_scan() is
INP_WLOCK. No sign of contention on TW_RLOCK(V_tw_lock) currently.

51.86% [2413083] lock_delay @ /boot/kernel.VSTREAM/kernel
100.0% [2413083] __rw_wlock_hard
100.0% [2413083] tcp_tw_2msl_scan

Slawa Olhovchenkov

unread,
Sep 21, 2016, 4:32:18 AM9/21/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
OK, but in case of leak lock (why inp lock too long for
tcp_tw_2msl_scan?) I can't see cause of this lock running this
commands after stuck happen?

> You can also use Dtrace and lockstat (especially with the lockstat -s
> option):
>
> https://wiki.freebsd.org/DTrace/One-Liners#Kernel_Locks
> https://www.freebsd.org/cgi/man.cgi?query=lockstat&manpath=FreeBSD+11.0-RELEASE
>
> But I am less familiar with Dtrace/lockstat tools.

OK, interesting too. Thanks.

> >> #1. Try above kernel options at least once, and see what you can get.
> >
> > OK, I am try this after some time.
> >
> >> #2. If #1 is a total failure try below patch: It won't solve anything,
> >> it just makes tcp_tw_2msl_scan() less greedy when there is contention on
> >> the INP write lock. If it makes the debugging more feasible, continue
> >> to #3.
> >
> > OK, thanks.
> > What purpose to not skip locked tcptw in this loop?
>
> If I understand your question correctly: According to your pmcstat
> result, tcp_tw_2msl_scan() currently struggles with a write lock
> (__rw_wlock_hard) and the only write lock used tcp_tw_2msl_scan() is
> INP_WLOCK. No sign of contention on TW_RLOCK(V_tw_lock) currently.

As I see in code, tcp_tw_2msl_scan got first node from V_twq_2msl and
need got RW lock on inp w/o alternates. Can tcp_tw_2msl_scan skip current node
and go to next node in V_twq_2msl list if current node locked by some
reasson?

Slawa Olhovchenkov

unread,
Sep 21, 2016, 3:52:28 PM9/21/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Wed, Sep 21, 2016 at 09:11:24AM +0200, Julien Charbon wrote:

>
> You can also use Dtrace and lockstat (especially with the lockstat -s
> option):
>
> https://wiki.freebsd.org/DTrace/One-Liners#Kernel_Locks
> https://www.freebsd.org/cgi/man.cgi?query=lockstat&manpath=FreeBSD+11.0-RELEASE
>
> But I am less familiar with Dtrace/lockstat tools.

I am still use old kernel and got lockdown again.
Try using lockstat (I am save more output), interesting may be next:

R/W writer spin on writer: 190019 events in 1.070 seconds (177571 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Caller
140839 74% 74% 0.00 24659 tcpinp tcp_tw_2msl_scan+0xc6

nsec ------ Time Distribution ------ count Stack
4096 | 913 tcp_twstart+0xa3
8192 |@@@@@@@@@@@@ 58191 tcp_do_segment+0x201f
16384 |@@@@@@ 29594 tcp_input+0xe1c
32768 |@@@@ 23447 ip_input+0x15f
65536 |@@@ 16197
131072 |@ 8674
262144 | 3358
524288 | 456
1048576 | 9
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Caller
49180 26% 100% 0.00 15929 tcpinp tcp_tw_2msl_scan+0xc6

nsec ------ Time Distribution ------ count Stack
4096 | 157 pfslowtimo+0x54
8192 |@@@@@@@@@@@@@@@ 24796 softclock_call_cc+0x179
16384 |@@@@@@ 11223 softclock+0x44
32768 |@@@@ 7426 intr_event_execute_handlers+0x95
65536 |@@ 3918
131072 | 1363
262144 | 278
524288 | 19
-------------------------------------------------------------------------------

Julien Charbon

unread,
Sep 21, 2016, 5:26:07 PM9/21/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
This is interesting, it seems that you have two call paths competing
for INP locks here:

- pfslowtimo()/tcp_tw_2msl_scan(reuse=0) and

- tcp_input()/tcp_twstart()/tcp_tw_2msl_scan(reuse=1)

These paths can indeed compete for the same INP lock, as both
tcp_tw_2msl_scan() calls always start with the first inp found in
twq_2msl list. But in both cases, this first inp should be quickly used
and its lock released anyway, thus that could explain your situation it
that the TCP stack is doing that all the time, for example:

- Let say that you are running out completely and constantly of tcptw,
and then all connections transitioning to TIME_WAIT state are competing
with the TIME_WAIT timeout scan that tries to free all the expired
tcptw. If the stack is doing that all the time, it can appear like
"live" locked.

This is just an hypothesis and as usual might be a red herring.
Anyway, could you run:

$ vmstat -z | head -2; vmstat -z | grep -E 'tcp|sock'

Ideally, once when everything is ok, and once when you have the issue
to see the differences (if any).

If it appears your are quite low in tcptw, and if you have enough
memory, could you try increase the tcptw limit using sysctl
net.inet.tcp.maxtcptw? And actually see if it improve (or not) your
performance.
signature.asc

Julien Charbon

unread,
Sep 22, 2016, 5:34:33 AM9/22/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
>>> What purpose to not skip locked tcptw in this loop?
>>
>> If I understand your question correctly: According to your pmcstat
>> result, tcp_tw_2msl_scan() currently struggles with a write lock
>> (__rw_wlock_hard) and the only write lock used tcp_tw_2msl_scan() is
>> INP_WLOCK. No sign of contention on TW_RLOCK(V_tw_lock) currently.
>
> As I see in code, tcp_tw_2msl_scan got first node from V_twq_2msl and
> need got RW lock on inp w/o alternates. Can tcp_tw_2msl_scan skip current node
> and go to next node in V_twq_2msl list if current node locked by some
> reasson?

Interesting question indeed: It is not optimal that all simultaneous
calls to tcp_tw_2msl_scan() compete for the same oldest tcptw. The next
tcptws in the list are certainly old enough also.

Let me see if I can make a simple change that makes kernel threads
calling tcp_tw_2msl_scan() at same time to work on a different old
enough tcptws. So far, I found only solutions quite complex to implement.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Sep 22, 2016, 5:42:39 AM9/22/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Thu, Sep 22, 2016 at 11:28:38AM +0200, Julien Charbon wrote:

> >>> What purpose to not skip locked tcptw in this loop?
> >>
> >> If I understand your question correctly: According to your pmcstat
> >> result, tcp_tw_2msl_scan() currently struggles with a write lock
> >> (__rw_wlock_hard) and the only write lock used tcp_tw_2msl_scan() is
> >> INP_WLOCK. No sign of contention on TW_RLOCK(V_tw_lock) currently.
> >
> > As I see in code, tcp_tw_2msl_scan got first node from V_twq_2msl and
> > need got RW lock on inp w/o alternates. Can tcp_tw_2msl_scan skip current node
> > and go to next node in V_twq_2msl list if current node locked by some
> > reasson?
>
> Interesting question indeed: It is not optimal that all simultaneous
> calls to tcp_tw_2msl_scan() compete for the same oldest tcptw. The next
> tcptws in the list are certainly old enough also.
>
> Let me see if I can make a simple change that makes kernel threads
> calling tcp_tw_2msl_scan() at same time to work on a different old
> enough tcptws. So far, I found only solutions quite complex to implement.

Simple solution is skip in each thread ncpu elemnts and skip curent
cpu number elements at start, if I understund you correctly.

Slawa Olhovchenkov

unread,
Sep 22, 2016, 5:55:01 AM9/22/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
I think same.

> These paths can indeed compete for the same INP lock, as both
> tcp_tw_2msl_scan() calls always start with the first inp found in
> twq_2msl list. But in both cases, this first inp should be quickly used
> and its lock released anyway, thus that could explain your situation it
> that the TCP stack is doing that all the time, for example:
>
> - Let say that you are running out completely and constantly of tcptw,
> and then all connections transitioning to TIME_WAIT state are competing
> with the TIME_WAIT timeout scan that tries to free all the expired
> tcptw. If the stack is doing that all the time, it can appear like
> "live" locked.
>
> This is just an hypothesis and as usual might be a red herring.
> Anyway, could you run:
>
> $ vmstat -z | head -2; vmstat -z | grep -E 'tcp|sock'

ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP

socket: 864, 4192664, 18604, 25348,49276158, 0, 0
tcp_inpcb: 464, 4192664, 34226, 18702,49250593, 0, 0
tcpcb: 1040, 4192665, 18424, 18953,49250593, 0, 0
tcptw: 88, 16425, 15802, 623,14526919, 8, 0
tcpreass: 40, 32800, 15, 2285, 632381, 0, 0

In normal case tcptw is about 16425/600/900

And after `sysctl -a | grep tcp` system stuck on serial console and I am reset it.

> Ideally, once when everything is ok, and once when you have the issue
> to see the differences (if any).
>
> If it appears your are quite low in tcptw, and if you have enough
> memory, could you try increase the tcptw limit using sysctl

I think this is not eliminate stuck, just may do it less frequency

> net.inet.tcp.maxtcptw? And actually see if it improve (or not) your
> performance.

I am already play with net.inet.tcp.maxtcptw and it not affect performance.

Julien Charbon

unread,
Sep 22, 2016, 6:05:20 AM9/22/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
You are right, it would just be a big hint that the tcp_tw_2msl_scan()
contention hypothesis is the right one. As I see you have plenty of
memory on your server, thus could you try with:

net.inet.tcp.maxtcptw=4192665

And see what happen. Just to validate this hypothesis.

Thanks.

--
Julien

Slawa Olhovchenkov

unread,
Sep 22, 2016, 6:21:14 AM9/22/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
This is bad way for validate, with maxtcptw=16384 happened is random
and can be waited for month. After maxtcptw=4192665 I am don't know
how long need to wait for verification this hypothesis.

More frequency (may be 3-5 times per day) happening less traffic drops
(not to zero for minutes). May be this caused also by contention in
tcp_tw_2msl_scan, but fast resolved (stochastic process). By eating
CPU power nginx can't service connection and clients closed
connections and need more TIME_WAIT and can trigered
tcp_tw_2msl_scan(reuse=1). After this we can got live lock.

May be after I learning to catch and dignostic this validation is more
accurately.

Slawa Olhovchenkov

unread,
Sep 23, 2016, 3:17:35 PM9/23/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
Some more bits:

socket: 864, 4192664, 30806, 790,28524160, 0, 0
ipq: 56, 32802, 0, 1278, 1022, 0, 0
udp_inpcb: 464, 4192664, 44, 364, 14066, 0, 0
udpcb: 32, 4192750, 44, 3081, 14066, 0, 0
tcp_inpcb: 464, 4192664, 38558, 378,28476709, 0, 0
tcpcb: 1040, 4192665, 30690, 738,28476709, 0, 0
tcptw: 88, 32805, 7868, 772, 8412249, 0, 0

last pid: 49575; load averages: 2.00, 2.05, 3.75 up 1+01:12:08 22:13:42
853 processes: 15 running, 769 sleeping, 35 waiting, 34 lock
CPU 0: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% idle
CPU 1: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 2: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 3: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 4: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 5: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 6: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% idle
CPU 7: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 8: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 9: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 10: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% idle
CPU 11: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 8659M Active, 8385M Inact, 107G Wired, 1325M Free
ARC: 99G Total, 88G MFU, 10G MRU, 32K Anon, 167M Header, 529M Other
Swap: 32G Total, 32G Free

PID UID PRI NICE SIZE RES STATE C TIME WCPU COMMAND
49566 0 20 0 26264K 7068K CPU10 10 0:00 0.14% top
5 0 -8 - 0K 192K l2arc_ 6 31:07 0.13% zfskern{l2arc_fe
46120 0 20 0 14608K 3868K nanslp 8 0:10 0.10% tcpw
12 0 -76 - 0K 848K WAIT 1 0:00 0.02% intr{swi0: uart


total = used + free
Max used total used total used total used total
2016-09-23 21:46:41 tcptw 32805 389 1665 0 tcpcb 27788 28557 inpcb 28177 29296 socket 27875 28740
2016-09-23 21:46:42 tcptw 32805 409 1665 0 tcpcb 27723 28557 inpcb 28132 29296 socket 27812 28740
2016-09-23 21:46:43 tcptw 32805 405 1665 0 tcpcb 27750 28557 inpcb 28155 29296 socket 27838 28740
2016-09-23 21:46:44 tcptw 32805 409 1665 0 tcpcb 27765 28557 inpcb 28174 29296 socket 27852 28740
2016-09-23 21:46:45 tcptw 32805 422 1665 0 tcpcb 27737 28557 inpcb 28159 29296 socket 27824 28740
2016-09-23 21:46:46 tcptw 32805 444 1665 0 tcpcb 27729 28557 inpcb 28173 29296 socket 27817 28740
2016-09-23 21:46:47 tcptw 32805 444 1665 0 tcpcb 27712 28557 inpcb 28156 29296 socket 27800 28740
2016-09-23 21:46:48 tcptw 32805 428 1665 0 tcpcb 27693 28557 inpcb 28121 29296 socket 27783 28740
2016-09-23 21:46:49 tcptw 32805 431 1665 0 tcpcb 27677 28557 inpcb 28108 29296 socket 27766 28740
2016-09-23 21:46:50 tcptw 32805 455 1665 0 tcpcb 27721 28557 inpcb 28176 29296 socket 27809 28740
2016-09-23 21:46:51 tcptw 32805 432 1665 0 tcpcb 27739 28557 inpcb 28171 29296 socket 27828 28740
2016-09-23 21:46:52 tcptw 32805 434 1665 0 tcpcb 27731 28557 inpcb 28165 29296 socket 27819 28740
2016-09-23 21:46:53 tcptw 32805 431 1665 0 tcpcb 27742 28557 inpcb 28173 29296 socket 27831 28740
2016-09-23 21:46:54 tcptw 32805 424 1665 0 tcpcb 27738 28557 inpcb 28162 29296 socket 27826 28740
2016-09-23 21:46:55 tcptw 32805 397 1665 0 tcpcb 27740 28557 inpcb 28137 29296 socket 27827 28740
2016-09-23 21:46:56 tcptw 32805 412 1665 0 tcpcb 27742 28557 inpcb 28154 29296 socket 27830 28740
2016-09-23 21:46:57 tcptw 32805 418 1665 0 tcpcb 27749 28557 inpcb 28167 29296 socket 27838 28740
2016-09-23 21:46:58 tcptw 32805 426 1665 0 tcpcb 27740 28557 inpcb 28166 29296 socket 27827 28740
2016-09-23 21:46:59 tcptw 32805 423 1665 0 tcpcb 27687 28557 inpcb 28110 29296 socket 27773 28740
2016-09-23 21:47:00 tcptw 32805 426 1665 0 tcpcb 27716 28557 inpcb 28142 29296 socket 27804 28740
2016-09-23 21:47:01 tcptw 32805 437 1665 0 tcpcb 27732 28557 inpcb 28169 29296 socket 27821 28740
2016-09-23 21:47:02 tcptw 32805 471 1665 0 tcpcb 27672 28557 inpcb 28143 29296 socket 27760 28740
2016-09-23 21:47:03 tcptw 32805 426 1665 0 tcpcb 27752 28557 inpcb 28178 29296 socket 27838 28740
2016-09-23 21:47:04 tcptw 32805 402 1665 0 tcpcb 27760 28557 inpcb 28162 29296 socket 27847 28740
2016-09-23 21:47:05 tcptw 32805 406 1665 0 tcpcb 27757 28557 inpcb 28163 29296 socket 27845 28740
2016-09-23 21:47:06 tcptw 32805 443 1665 0 tcpcb 27783 28557 inpcb 28226 29296 socket 27871 28740
2016-09-23 21:47:07 tcptw 32805 484 1665 0 tcpcb 27707 28557 inpcb 28191 29296 socket 27794 28740
2016-09-23 21:47:08 tcptw 32805 473 1665 0 tcpcb 27721 28557 inpcb 28194 29296 socket 27807 28740
2016-09-23 21:47:09 tcptw 32805 432 1665 0 tcpcb 27749 28557 inpcb 28181 29296 socket 27837 28740
2016-09-23 21:47:10 tcptw 32805 421 1665 0 tcpcb 27780 28557 inpcb 28201 29296 socket 27868 28740
2016-09-23 21:47:11 tcptw 32805 530 1665 0 tcpcb 27814 28557 inpcb 28344 29296 socket 27902 28740
2016-09-23 21:47:12 tcptw 32805 680 1665 0 tcpcb 27874 28557 inpcb 28554 29296 socket 27964 28740
2016-09-23 21:47:13 tcptw 32805 832 1665 0 tcpcb 27881 28557 inpcb 28713 29296 socket 27971 28740
2016-09-23 21:47:14 tcptw 32805 997 1665 0 tcpcb 27880 28557 inpcb 28877 29352 socket 27972 28740
2016-09-23 21:47:15 tcptw 32805 1155 1890 0 tcpcb 27931 28557 inpcb 29086 29560 socket 28023 28740
2016-09-23 21:47:16 tcptw 32805 1322 2250 0 tcpcb 27981 28557 inpcb 29303 29800 socket 28075 28740
2016-09-23 21:47:17 tcptw 32805 1496 2385 0 tcpcb 28065 28557 inpcb 29561 30040 socket 28159 28740
2016-09-23 21:47:18 tcptw 32805 1648 2385 0 tcpcb 28151 28557 inpcb 29799 30280 socket 28245 28740
2016-09-23 21:47:19 tcptw 32805 1790 2655 0 tcpcb 28398 28599 inpcb 30188 30672 socket 28492 28796
2016-09-23 21:47:20 tcptw 32805 1954 2655 0 tcpcb 28712 28923 inpcb 30666 31120 socket 28807 29116
2016-09-23 21:47:21 tcptw 32805 2115 3015 0 tcpcb 29061 29244 inpcb 31176 31576 socket 29156 29468
2016-09-23 21:47:22 tcptw 32805 2265 3150 0 tcpcb 29335 29538 inpcb 31600 32056 socket 29430 29704
2016-09-23 21:47:23 tcptw 32805 2424 3150 0 tcpcb 29553 29775 inpcb 31977 32440 socket 29648 29956
2016-09-23 21:47:24 tcptw 32805 2590 3375 0 tcpcb 29711 29901 inpcb 32301 32744 socket 29807 30112
2016-09-23 21:47:25 tcptw 32805 2760 3780 0 tcpcb 29794 30015 inpcb 32554 33040 socket 29891 30224
2016-09-23 21:47:26 tcptw 32805 2935 3915 0 tcpcb 29879 30111 inpcb 32814 33312 socket 29976 30292
2016-09-23 21:47:27 tcptw 32805 3109 3915 0 tcpcb 29953 30195 inpcb 33062 33584 socket 30050 30392
2016-09-23 21:47:28 tcptw 32805 3264 4140 0 tcpcb 30060 30267 inpcb 33324 33824 socket 30158 30476
2016-09-23 21:47:29 tcptw 32805 3435 4275 0 tcpcb 30137 30363 inpcb 33572 34032 socket 30235 30572
2016-09-23 21:47:30 tcptw 32805 3600 4500 0 tcpcb 30221 30489 inpcb 33821 34304 socket 30320 30644
2016-09-23 21:47:31 tcptw 32805 3775 4635 0 tcpcb 30309 30588 inpcb 34084 34576 socket 30408 30740
2016-09-23 21:47:32 tcptw 32805 3936 4770 0 tcpcb 30534 30741 inpcb 34470 34960 socket 30634 30908
2016-09-23 21:47:33 tcptw 32805 4097 4905 0 tcpcb 30744 30951 inpcb 34841 35352 socket 30844 31160
2016-09-23 21:47:34 tcptw 32805 4233 5040 0 tcpcb 31006 31176 inpcb 35239 35680 socket 31106 31372
2016-09-23 21:47:35 tcptw 32805 4366 5265 0 tcpcb 31160 31386 inpcb 35526 35920 socket 31260 31568
2016-09-23 21:47:36 tcptw 32805 4738 5535 0 tcpcb 29529 31428 inpcb 34267 36016 socket 29629 31596
2016-09-23 21:47:37 tcptw 32805 4879 5625 0 tcpcb 29506 31428 inpcb 34385 36016 socket 29607 31596
2016-09-23 21:47:38 tcptw 32805 5011 5895 0 tcpcb 29590 31428 inpcb 34601 36016 socket 29691 31596
2016-09-23 21:47:39 tcptw 32805 5130 5895 0 tcpcb 29713 31428 inpcb 34843 36016 socket 29815 31596
2016-09-23 21:47:40 tcptw 32805 5259 6165 0 tcpcb 29783 31428 inpcb 35042 36016 socket 29886 31596
2016-09-23 21:47:41 tcptw 32805 5378 6255 0 tcpcb 29606 31428 inpcb 34984 36016 socket 29709 31596
2016-09-23 21:47:42 tcptw 32805 5489 6255 0 tcpcb 29638 31428 inpcb 35127 36016 socket 29741 31596
2016-09-23 21:47:43 tcptw 32805 5629 6390 0 tcpcb 29630 31428 inpcb 35259 36016 socket 29735 31596
2016-09-23 21:47:44 tcptw 32805 5754 6660 0 tcpcb 29593 31428 inpcb 35347 36016 socket 29696 31596
2016-09-23 21:47:45 tcptw 32805 5887 6660 0 tcpcb 29606 31428 inpcb 35493 36016 socket 29709 31596
2016-09-23 21:47:46 tcptw 32805 6011 6750 0 tcpcb 29613 31428 inpcb 35624 36016 socket 29716 31596
2016-09-23 21:47:47 tcptw 32805 6128 7020 0 tcpcb 29642 31428 inpcb 35770 36128 socket 29745 31596
2016-09-23 21:47:48 tcptw 32805 6250 7020 0 tcpcb 29742 31428 inpcb 35992 36416 socket 29845 31596
2016-09-23 21:47:49 tcptw 32805 6378 7155 0 tcpcb 29745 31428 inpcb 36123 36472 socket 29850 31596
2016-09-23 21:47:50 tcptw 32805 6486 7290 0 tcpcb 29756 31428 inpcb 36242 36648 socket 29861 31596
2016-09-23 21:47:51 tcptw 32805 6603 7515 0 tcpcb 29807 31428 inpcb 36410 36792 socket 29912 31596
2016-09-23 21:47:52 tcptw 32805 6736 7515 0 tcpcb 29830 31428 inpcb 36566 36912 socket 29935 31596
2016-09-23 21:47:53 tcptw 32805 6852 7785 0 tcpcb 29892 31428 inpcb 36744 37112 socket 29996 31596
2016-09-23 21:47:54 tcptw 32805 6991 7785 0 tcpcb 29876 31428 inpcb 36867 37288 socket 29981 31596
2016-09-23 21:47:55 tcptw 32805 7102 8010 0 tcpcb 29928 31428 inpcb 37030 37400 socket 30033 31596
2016-09-23 21:47:56 tcptw 32805 7227 8010 0 tcpcb 29960 31428 inpcb 37187 37544 socket 30065 31596
2016-09-23 21:47:57 tcptw 32805 7356 8280 0 tcpcb 30031 31428 inpcb 37387 37752 socket 30136 31596
2016-09-23 21:47:58 tcptw 32805 7505 8415 0 tcpcb 30164 31428 inpcb 37669 38040 socket 30270 31596
2016-09-23 21:47:59 tcptw 32805 7618 8415 0 tcpcb 30302 31428 inpcb 37920 38328 socket 30408 31596
2016-09-23 21:48:00 tcptw 32805 7740 8505 0 tcpcb 30478 31428 inpcb 38218 38560 socket 30584 31596
2016-09-23 21:48:01 tcptw 32805 7858 8640 0 tcpcb 30663 31428 inpcb 38521 38880 socket 30769 31596

Slawa Olhovchenkov

unread,
Sep 23, 2016, 4:02:11 PM9/23/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Wed, Sep 21, 2016 at 11:25:18PM +0200, Julien Charbon wrote:

>
My current hypothesis:

nginx do write() (or may be close()?) to socket, kernel lock
first inp in V_twq_2msl, happen callout for pfslowtimo() on the same
CPU core and tcp_tw_2msl_scan infinity locked on same inp.

In this case you modification can't help, before next try we need some
like yeld().

Slawa Olhovchenkov

unread,
Sep 25, 2016, 8:47:06 AM9/25/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
Or may be locks leaks.
Or both.

Slawa Olhovchenkov

unread,
Sep 25, 2016, 2:58:47 PM9/25/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
I can collect some more info.
First:

1046 - - R 125:10.79 nginx: worker process (nginx)
1047 kqread - I 150:11.98 nginx: worker process (nginx)
1049 kqread - I 145:58.35 nginx: worker process (nginx)
1050 kqread - I 136:33.36 nginx: worker process (nginx)
1051 kqread - I 140:59.73 nginx: worker process (nginx)
1052 kqread - I 137:18.12 nginx: worker process (nginx)

pid 1046 is nginx running on CPU0 (affinity mask set).

# procstat -k -k 1046
PID TID COMM TDNAME KSTACK
1046 100686 nginx - mi_switch+0xd2 critical_exit+0x7e lapic_handle_timer+0xb1 Xtimerint+0x8c __mtx_lock_sleep+0x168 zone_fetch_slab+0x47 zone_import+0x52 zone_alloc_item+0x36 keg_alloc_slab+0x63 keg_fetch_slab+0x16e zone_fetch_slab+0x6e zone_import+0x52 uma_zalloc_arg+0x36e m_getm2+0x14f m_uiotombuf+0x64 sosend_generic+0x356 soo_write+0x42 dofilewrite+0x87

Tracing command nginx pid 1046 tid 100686 td 0xfffff8014485f500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe20216992a0 /usr/src/sys/kern/sched_ule.c:1973
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe20216992d0 /usr/src/sys/kern/kern_synch.c:465
critical_exit() at 0xffffffff804a6bee = critical_exit+0x7e/frame 0xfffffe20216992f0 /usr/src/sys/kern/kern_switch.c:219
lapic_handle_timer() at 0xffffffff80771701 = lapic_handle_timer+0xb1/frame 0xfffffe2021699330 /usr/src/sys/x86/x86/local_apic.c:1185
Xtimerint() at 0xffffffff806cbbcc = Xtimerint+0x8c/frame 0xfffffe2021699330 /usr/src/sys/amd64/amd64/apic_vector.S:135
--- interrupt, rip = 0xffffffff804de424, rsp = 0xfffffe2021699400, rbp = 0xfffffe2021699420 ---
lock_delay() at 0xffffffff804de424 = lock_delay+0x54/frame 0xfffffe2021699420 /usr/src/sys/kern/subr_lock.c:127
__mtx_lock_sleep() at 0xffffffff80484dc8 = __mtx_lock_sleep+0x168/frame 0xfffffe20216994a0 /usr/src/sys/kern/kern_mutex.c:512
zone_fetch_slab() at 0xffffffff806a4257 = zone_fetch_slab+0x47/frame 0xfffffe20216994e0 /usr/src/sys/vm/uma_core.c:2378
zone_import() at 0xffffffff806a4312 = zone_import+0x52/frame 0xfffffe2021699530 /usr/src/sys/vm/uma_core.c:2501
zone_alloc_item() at 0xffffffff806a0986 = zone_alloc_item+0x36/frame 0xfffffe2021699570 /usr/src/sys/vm/uma_core.c:2591
keg_alloc_slab() at 0xffffffff806a2463 = keg_alloc_slab+0x63/frame 0xfffffe20216995d0 /usr/src/sys/vm/uma_core.c:965
keg_fetch_slab() at 0xffffffff806a48ce = keg_fetch_slab+0x16e/frame 0xfffffe2021699620 /usr/src/sys/vm/uma_core.c:2349
zone_fetch_slab() at 0xffffffff806a427e = zone_fetch_slab+0x6e/frame 0xfffffe2021699660 /usr/src/sys/vm/uma_core.c:2375
zone_import() at 0xffffffff806a4312 = zone_import+0x52/frame 0xfffffe20216996b0 /usr/src/sys/vm/uma_core.c:2501
uma_zalloc_arg() at 0xffffffff806a147e = uma_zalloc_arg+0x36e/frame 0xfffffe2021699720 /usr/src/sys/vm/uma_core.c:2531
m_getm2() at 0xffffffff8048231f = m_getm2+0x14f/frame 0xfffffe2021699790 /usr/src/sys/kern/kern_mbuf.c:830
m_uiotombuf() at 0xffffffff80516044 = m_uiotombuf+0x64/frame 0xfffffe20216997e0 /usr/src/sys/kern/uipc_mbuf.c:1535
sosend_generic() at 0xffffffff8051ce56 = sosend_generic+0x356/frame 0xfffffe20216998a0
soo_write() at 0xffffffff804fd872 = soo_write+0x42/frame 0xfffffe20216998d0
dofilewrite() at 0xffffffff804f5c97 = dofilewrite+0x87/frame 0xfffffe2021699920
kern_writev() at 0xffffffff804f5978 = kern_writev+0x68/frame 0xfffffe2021699970
sys_writev() at 0xffffffff804f5be6 = sys_writev+0x36/frame 0xfffffe20216999a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe2021699ab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe2021699ab0
--- syscall (121, FreeBSD ELF64, sys_writev), rip = 0x8019cc6ba, rsp = 0x7fffffffd688, rbp = 0x7fffffffd6c0 ---

And I am now collect and save kernel dump.

# vmstat -M /var/crash/vmcore.1 -z| grep -i mbuf
mbuf_packet: 256, 52276395, 26987, 189713,705271653, 0, 0
mbuf: 256, 52276395, 61994, 782439,46250344524, 0, 0
mbuf_cluster: 2048, 1048576, 49207, 34461, 293307, 0, 0
mbuf_jumbo_page: 4096, 4084093, 61839, 613537,8265099417, 0, 0
mbuf_jumbo_9k: 9216, 3630303, 0, 0, 0, 0, 0
mbuf_jumbo_16k: 16384, 2722728, 0, 0, 0, 0, 0

Julien Charbon

unread,
Sep 26, 2016, 3:12:57 AM9/26/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
So my hypothesis was wrong it is not a contention between
pfslowtimo()/tcp_tw_2msl_scan(reuse=0) and
tcp_input()/tcp_twstart()/tcp_tw_2msl_scan(reuse=1) as you still have
plenty of unused tcptw.

It is an issue between tcp_tw_2msl_scan(reuse=0) and something else.

--
Julien


signature.asc

Julien Charbon

unread,
Sep 26, 2016, 4:52:00 AM9/26/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
This call stack is quite interesting:

1: A process is calling writev()
2: Kernel calls sosend_generic() that starts allocating memory
3: This allocation is then interrupted by the timer interrupt handler
[that could actually trigger tcp_tw_2msl_scan(reuse=0)]
4: The timer interrupt handler seems to wait on sched_switch()

And fun fact: When sosend_generic() calls m_uiotombuf() it does not
hold INP_WLOCK yet...

--
Julien

signature.asc

Julien Charbon

unread,
Sep 26, 2016, 5:34:12 AM9/26/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
You are totally right, pfslowtimo()/tcp_tw_2msl_scan(reuse=0) is
infinitely blocked on INP_WLOCK() by "something" (that could be related
to write()).

As I reached my limit of debugging without WITNESS, could you share
your /etc/sysctl.conf, /boot/loader.conf files? And any specific
configuration you have (like having a Nginx workers affinity, Nginx
special options, etc.). Like that I can try to reproduce it on releng/11.0.

Thanks.

--
Julien

signature.asc

Julien Charbon

unread,
Sep 26, 2016, 7:57:44 AM9/26/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Actually one extra debug thing you can do is launching lockstat with
below extra options:

-H For Hold lock stats
-P To get the overall time
-s 20 To get the stackstrace

To see who is holding the INP lock for so long. Thanks to Hiren for
pointing the -H option.

--
Julien

Slawa Olhovchenkov

unread,
Sep 26, 2016, 8:05:27 AM9/26/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
No, this is more interesting: double call (recuersion) to zone_import()!

> And fun fact: When sosend_generic() calls m_uiotombuf() it does not
> hold INP_WLOCK yet...

Yes, is not INP_WLOCK related, this is like next error.

Slawa Olhovchenkov

unread,
Sep 26, 2016, 8:13:14 AM9/26/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Mon, Sep 26, 2016 at 11:33:12AM +0200, Julien Charbon wrote:

> >>> - tcp_input()/tcp_twstart()/tcp_tw_2msl_scan(reuse=1)
> >>
> >> My current hypothesis:
> >>
> >> nginx do write() (or may be close()?) to socket, kernel lock
> >> first inp in V_twq_2msl, happen callout for pfslowtimo() on the same
> >> CPU core and tcp_tw_2msl_scan infinity locked on same inp.
> >>
> >> In this case you modification can't help, before next try we need some
> >> like yeld().
> >
> > Or may be locks leaks.
> > Or both.
>
> You are totally right, pfslowtimo()/tcp_tw_2msl_scan(reuse=0) is
> infinitely blocked on INP_WLOCK() by "something" (that could be related
> to write()).
>
> As I reached my limit of debugging without WITNESS, could you share
> your /etc/sysctl.conf, /boot/loader.conf files? And any specific
> configuration you have (like having a Nginx workers affinity, Nginx
> special options, etc.). Like that I can try to reproduce it on releng/11.0.

I am use double socket server, E5-2620.
Double Intel 10G NIC, affinity to CPU 6..11.
Nginx affinity to CPU 0..5.
I.e. on CPU 0 only nginx worker affinity exist and NIC IRQ handler
activity only on CPU 6..11.

/boot/loader.conf:

kern.geom.label.gptid.enable="0"
zfs_load="YES"
#### generated by conf.pl #########
hw.memtest.tests=0
machdep.hyperthreading_allowed=0
kern.geom.label.disk_ident.enable=0
if_igb_load=yes
if_ix_load=yes
hw.ix.num_queues=3
hw.ix.rxd=4096
hw.ix.txd=4096
hw.ix.rx_process_limit=-1
hw.ix.tx_process_limit=-1
if_lagg_load=YES
net.link.lagg.default_use_flowid=0
accf_http_load=yes
aio_load=yes
cc_htcp_load=yes
kern.ipc.nmbclusters=1048576
net.inet.tcp.reass.maxsegments=32768
net.inet.tcp.hostcache.cachelimit=0
net.inet.tcp.hostcache.hashsize=32768
net.inet.tcp.syncache.hashsize=32768
#net.inet.tcp.tcbhashsize=262144
net.inet.tcp.tcbhashsize=65536
net.inet.tcp.maxtcptw=16384
kern.pin_default_swi=1
kern.pin_pcpu_swi=1
kern.hwpmc.nbuffers=131072
hw.cxgbe.qsize_rxq=16384
hw.cxgbe.qsize_txq=16384
hw.cxgbe.nrxq10g=3
kernel="kernel.VSTREAM"
kernels="kernel"
hw.mps.max_chains=3072
###
hw.vga.textmode=1
uhci_load=yes
ohci_load=yes
ehci_load=yes
xhci_load=yes
ukbd_load=yes
umass_load=yes
###
boot_multicons="YES"
boot_serial="YES"
comconsole_speed="115200"
comconsole_port=760
#console="comconsole,vidconsole"
console="vidconsole,comconsole"
hint.uart.0.flags="0x00"
hint.uart.1.flags="0x10"

/etc/sysctl.conf:

kern.random.sys.harvest.ethernet=0
kern.threads.max_threads_per_proc=20000
net.inet.ip.maxfragpackets=32768
net.inet.ip.fastforwarding=1
kern.ipc.somaxconn=4096
kern.ipc.nmbjumbop=2097152
kern.ipc.maxsockbuf=16777216
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendbuf_inc=16384
net.inet.tcp.sendspace=2097152
#net.inet.tcp.maxtcptw=444800
net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.msl=1000
net.inet.tcp.cc.algorithm=htcp
net.inet.tcp.per_cpu_timers=1
#net.inet.tcp.syncookies=0
net.inet6.ip6.auto_linklocal=0
kern.maxfiles=300000
kern.maxfilesperproc=80000
#hw.intr_storm_threshold=9000
vfs.zfs.prefetch_disable=1
vfs.zfs.vdev.max_pending=1000
vfs.zfs.l2arc_noprefetch=0
vfs.zfs.l2arc_norw=0
vfs.zfs.l2arc_write_boost=134217728
vfs.zfs.l2arc_write_max=33554432
vfs.aio.max_aio_procs=512
vfs.aio.max_aio_queue_per_proc=8192
vfs.aio.max_aio_per_proc=8192
vfs.aio.max_aio_queue=65536
net.inet.tcp.finwait2_timeout=5000

kern.corefile=/tmp/%N.%P.core
kern.sugid_coredump=1

Now (after this nginx lockout) I am use you patch witch modification:
act return NULL at write lock and now see only mbuf-related work.

Slawa Olhovchenkov

unread,
Sep 26, 2016, 8:39:27 AM9/26/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
At time of this graph I am collect output from `lockstat -PH -s 5
sleep 1` too and don't see any interesting -- I am think lock holded
before lockstat run don't detected and don't showed.

I still can show collected output, if you need. hundreds of lines.

Slawa Olhovchenkov

unread,
Sep 26, 2016, 1:22:32 PM9/26/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Mon, Sep 26, 2016 at 11:33:12AM +0200, Julien Charbon wrote:

>
Some more traces from ddb:

Tracing command intr pid 12 tid 100103 td 0xfffff8012508ea00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2020ea8330
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2020ea8360
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2020ea83a0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe2020ea8420
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe2020ea8460
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe2020ea8490
tcp_do_segment() at 0xffffffff80610226 = tcp_do_segment+0x1666/frame 0xfffffe2020ea8590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020ea86e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020ea8740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020ea87a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020ea87d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020ea8830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020ea8890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020ea88b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020ea8990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020ea89e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020ea8a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020ea8a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020ea8ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020ea8ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100105 td 0xfffff8012508e000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2020eb2330
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2020eb2360
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2020eb23a0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe2020eb2420
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe2020eb2460
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe2020eb2490
tcp_do_segment() at 0xffffffff80610226 = tcp_do_segment+0x1666/frame 0xfffffe2020eb2590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020eb26e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020eb2740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020eb27a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020eb27d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020eb2830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020eb2890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020eb28b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020eb2990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020eb29e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020eb2a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020eb2a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020eb2ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020eb2ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100107 td 0xfffff8012508d500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2020ebc2b0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2020ebc2e0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2020ebc320
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe2020ebc3a0
soalloc() at 0xffffffff8051b914 = soalloc+0x1b4/frame 0xfffffe2020ebc3f0
sonewconn() at 0xffffffff8051bb9f = sonewconn+0xbf/frame 0xfffffe2020ebc430
syncache_expand() at 0xffffffff8061b85b = syncache_expand+0x78b/frame 0xfffffe2020ebc590
tcp_input() at 0xffffffff8060e10e = tcp_input+0xdae/frame 0xfffffe2020ebc6e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020ebc740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020ebc7a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020ebc7d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020ebc830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020ebc890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020ebc8b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020ebc990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020ebc9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020ebca20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020ebca70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020ebcab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020ebcab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100111 td 0xfffff801250a2000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2020f302f0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2020f30320
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2020f30360
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe2020f303e0
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe2020f30420
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe2020f30450
tcp_twstart() at 0xffffffff8061f0e7 = tcp_twstart+0x2b7/frame 0xfffffe2020f30490
tcp_do_segment() at 0xffffffff80610bdf = tcp_do_segment+0x201f/frame 0xfffffe2020f30590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020f306e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020f30740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020f307a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020f307d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020f30830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020f30890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020f308b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020f30990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020f309e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020f30a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020f30a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020f30ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020f30ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100113 td 0xfffff801250a1500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2020f3a2f0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2020f3a320
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2020f3a360
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe2020f3a3e0
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe2020f3a420
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe2020f3a450
tcp_twstart() at 0xffffffff8061f0e7 = tcp_twstart+0x2b7/frame 0xfffffe2020f3a490
tcp_do_segment() at 0xffffffff80610bdf = tcp_do_segment+0x201f/frame 0xfffffe2020f3a590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020f3a6e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020f3a740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020f3a7a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020f3a7d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020f3a830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020f3a890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020f3a8b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020f3a990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020f3a9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020f3aa20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020f3aa70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020f3aab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020f3aab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100115 td 0xfffff801250a0a00
cpustop_handler() at 0xffffffff80775998 = cpustop_handler+0x28/frame 0xfffffe1f9e182cf0
ipi_nmi_handler() at 0xffffffff8077595a = ipi_nmi_handler+0x4a/frame 0xfffffe1f9e182d10
trap() at 0xffffffff806e2e4a = trap+0x3a/frame 0xfffffe1f9e182f20
nmi_calltrap() at 0xffffffff806cb413 = nmi_calltrap+0x8/frame 0xfffffe1f9e182f20
--- trap 0x13, rip = 0xffffffff8059b9f9, rsp = 0xfffffe2020f44420, rbp = 0xfffffe2020f44420 ---
in_pcbref() at 0xffffffff8059b9f9 = in_pcbref+0x9/frame 0xfffffe2020f44420
tcp_tw_2msl_scan() at 0xffffffff8061f1a3 = tcp_tw_2msl_scan+0x73/frame 0xfffffe2020f44450
tcp_twstart() at 0xffffffff8061eed3 = tcp_twstart+0xa3/frame 0xfffffe2020f44490
tcp_do_segment() at 0xffffffff80610bdf = tcp_do_segment+0x201f/frame 0xfffffe2020f44590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020f446e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020f44740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020f447a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020f447d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020f44830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020f44890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020f448b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020f44990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020f449e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020f44a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020f44a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020f44ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020f44ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command ps pid 37011 tid 101992 td 0xfffff80144378000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe20224367b0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe20224367e0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2022436820
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe20224368a0
soalloc() at 0xffffffff8051b914 = soalloc+0x1b4/frame 0xfffffe20224368f0
socreate() at 0xffffffff8051b617 = socreate+0xa7/frame 0xfffffe2022436940
sys_socket() at 0xffffffff8052144d = sys_socket+0xed/frame 0xfffffe20224369a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe2022436ab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe2022436ab0
--- syscall (97, FreeBSD ELF64, sys_socket), rip = 0x8011c413a, rsp = 0x7fffffffc748, rbp = 0x7fffffffc770 ---

Tracing command cron pid 37008 tid 102228 td 0xfffff801a4090000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe20228d67b0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe20228d67e0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe20228d6820
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe20228d68a0
soalloc() at 0xffffffff8051b914 = soalloc+0x1b4/frame 0xfffffe20228d68f0
socreate() at 0xffffffff8051b617 = socreate+0xa7/frame 0xfffffe20228d6940
sys_socket() at 0xffffffff8052144d = sys_socket+0xed/frame 0xfffffe20228d69a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe20228d6ab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe20228d6ab0
--- syscall (97, FreeBSD ELF64, sys_socket), rip = 0x800d8c13a, rsp = 0x7fffffffd658, rbp = 0x7fffffffd6f0 ---

[many likes]

Tracing command intr pid 12 tid 100015 td 0xfffff8011422b000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1cf760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1cf790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1cf7d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1cf850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1cf890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1cf8c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1cf8f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1cf9c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1cf9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1cfa20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1cfa70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1cfab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1cfab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100016 td 0xfffff8011422aa00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1d4760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1d4790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1d47d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1d4850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1d4890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1d48c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1d48f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1d49c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1d49e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1d4a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1d4a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1d4ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1d4ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100017 td 0xfffff8011422a500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1d9760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1d9790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1d97d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1d9850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1d9890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1d98c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1d98f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1d99c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1d99e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1d9a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1d9a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1d9ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1d9ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100018 td 0xfffff8011422a000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1de760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1de790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1de7d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1de850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1de890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1de8c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1de8f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1de9c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1de9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1dea20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1dea70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1deab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1deab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100019 td 0xfffff8011424da00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1e3760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1e3790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1e37d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1e3850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1e3890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1e38c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1e38f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1e39c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1e39e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute-- 7:zsh -- time-stamp -- Sep/26/16 20:00:13 --
-- 7:zsh -- time-stamp -- Sep/26/16 20:00:13 --
_handlers+0x95/frame 0xfffffe1f9e1e3a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1e3a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1e3ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1e3ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100020 td 0xfffff8011424d500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1e8760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1e8790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1e87d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1e8850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1e8890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1e88c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1e88f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1e89c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1e89e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1e8a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1e8a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1e8ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1e8ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100021 td 0xfffff8011424d000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1ed760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1ed790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1ed7d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1ed850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1ed890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1ed8c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1ed8f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1ed9c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1ed9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1eda20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1eda70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1edab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1edab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100022 td 0xfffff8011424ca00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1f2760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1f2790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1f27d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1f2850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1f2890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1f28c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1f28f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1f29c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1f29e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1f2a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1f2a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1f2ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1f2ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100023 td 0xfffff8011424c500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1f7760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1f7790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1f77d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1f7850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1f7890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1f78c0
tcp_timer_rexmt() at 0xffffffff8061e454 = tcp_timer_rexmt+0x114/frame 0xfffffe1f9e1f78f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1f79c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1f79e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1f7a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1f7a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1f7ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1f7ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100024 td 0xfffff8011424c000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1fc760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1fc790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1fc7d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe1f9e1fc850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe1f9e1fc890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1fc8c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1fc8f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1fc9c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1fc9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1fca20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1fca70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1fcab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1fcab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100025 td 0xfffff8011424ba00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe0000382760
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe0000382790
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe00003827d0
__mtx_lock_sleep() at 0xffffffff80484d9d = __mtx_lock_sleep+0x13d/frame 0xfffffe0000382850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe0000382890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe00003828c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe00003828f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe00003829c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe00003829e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe0000382a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe0000382a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe0000382ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe0000382ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100026 td 0xfffff8011424b500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe00003876f0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe0000387720
critical_exit() at 0xffffffff804a6bee = critical_exit+0x7e/frame 0xfffffe0000387740
ipi_bitmap_handler() at 0xffffffff80775629 = ipi_bitmap_handler+0x79/frame 0xfffffe0000387780
Xipi_intr_bitmap_handler() at 0xffffffff806cc15e = Xipi_intr_bitmap_handler+0x8e/frame 0xfffffe0000387780
--- interrupt, rip = 0xffffffff80484c1f, rsp = 0xfffffe0000387850, rbp = 0xfffffe0000387850 ---
__mtx_lock_flags() at 0xffffffff80484c1f = __mtx_lock_flags+0x2f/frame 0xfffffe0000387850
sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe0000387890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe00003878c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe00003878f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe00003879c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe00003879e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe0000387a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe0000387a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe0000387ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe0000387ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Julien Charbon

unread,
Sep 28, 2016, 6:07:39 AM9/28/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Nice stack traces, all threads are blocked in sodealloc() or soalloc()
and if you look at how mtx_lock(&so_global_mtx) and
mtx_unlock(&so_global_mtx) are used, it is hard to think about a
scenario that can lead to this state.

I am still trying to reproduce your issue, without success so far.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Sep 28, 2016, 8:41:47 AM9/28/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Wed, Sep 28, 2016 at 12:06:47PM +0200, Julien Charbon wrote:

> > Tracing command intr pid 12 tid 100026 td 0xfffff8011424b500
> > sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe00003876f0
> > mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe0000387720
> > critical_exit() at 0xffffffff804a6bee = critical_exit+0x7e/frame 0xfffffe0000387740
> > ipi_bitmap_handler() at 0xffffffff80775629 = ipi_bitmap_handler+0x79/frame 0xfffffe0000387780
> > Xipi_intr_bitmap_handler() at 0xffffffff806cc15e = Xipi_intr_bitmap_handler+0x8e/frame 0xfffffe0000387780
> > --- interrupt, rip = 0xffffffff80484c1f, rsp = 0xfffffe0000387850, rbp = 0xfffffe0000387850 ---
> > __mtx_lock_flags() at 0xffffffff80484c1f = __mtx_lock_flags+0x2f/frame 0xfffffe0000387850
> > sodealloc() at 0xffffffff8051b992 = sodealloc+0x32/frame 0xfffffe0000387890
> > tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe00003878c0
> > tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe00003878f0
> > softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe00003879c0
> > softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe00003879e0
> > intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe0000387a20
> > ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe0000387a70
> > fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe0000387ab0
> > fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe0000387ab0
> > --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
>
> Nice stack traces, all threads are blocked in sodealloc() or soalloc()
> and if you look at how mtx_lock(&so_global_mtx) and
> mtx_unlock(&so_global_mtx) are used, it is hard to think about a
> scenario that can lead to this state.
>
> I am still trying to reproduce your issue, without success so far.

May be some hardware-related (low-speed CPU?).
Yesternight I am collect new stack traces and kernel dump.
May be I can see something?

db> ps
pid ppid pgrp uid state wmesg wchan cmd
12 0 0 0 RL (threaded) [intr]
100023 RunQ [swi4: clock (8)]
100107 Run CPU 8 [irq291: ix0:q2]
11 0 0 0 RL (threaded) [idle]
100011 CanRun [idle: cpu8]

cpuid = 8
dynamic pcpu = 0xfffffe201d69cf00
curthread = 0xfffff8012508d500: pid 12 "irq291: ix0:q2"
curpcb = 0xfffffe2020ebcb80
fpcurthread = none
idlethread = 0xfffff8011422c500: tid 100011 "idle: cpu8"
curpmap = 0xffffffff80d49998
tssp = 0xffffffff80d7fcd0
commontssp = 0xffffffff80d7fcd0
rsp0 = 0xfffffe2020ebcb80
gs32p = 0xffffffff80d86528
ldt = 0xffffffff80d86568
tss = 0xffffffff80d86558


Tracing command nginx pid 1061 tid 101747 td 0xfffff8014b35b500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2021b70330
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2021b70360
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2021b703a0
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe2021b70430
in_lltable_lookup() at 0xffffffff80594823 = in_lltable_lookup+0x83/frame 0xfffffe2021b70450
arpresolve() at 0xffffffff8058d2aa = arpresolve+0x9a/frame 0xfffffe2021b704b0
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe2021b70550
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe2021b706b0
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe2021b70850
tcp_usr_disconnect() at 0xffffffff80620094 = tcp_usr_disconnect+0x74/frame 0xfffffe2021b70880
soclose() at 0xffffffff8051c238 = soclose+0x38/frame 0xfffffe2021b708b0
_fdrop() at 0xffffffff8045639a = _fdrop+0x1a/frame 0xfffffe2021b708d0
closef() at 0xffffffff80458a53 = closef+0x1e3/frame 0xfffffe2021b70960
closefp() at 0xffffffff804567ad = closefp+0x7d/frame 0xfffffe2021b709a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe2021b70ab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe2021b70ab0
--- syscall (6, FreeBSD ELF64, sys_close), rip = 0x8019dbeaa, rsp = 0x7fffffffe6a8, rbp = 0x7fffffffe6c0 ---

Tracing command nginx pid 1060 tid 101749 td 0xfffff80126a53a00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2021b7a240
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2021b7a270
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2021b7a2b0
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe2021b7a340
in_lltable_lookup() at 0xffffffff80594823 = in_lltable_lookup+0x83/frame 0xfffffe2021b7a360
arpresolve() at 0xffffffff8058d2aa = arpresolve+0x9a/frame 0xfffffe2021b7a3c0
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe2021b7a460
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe2021b7a5c0
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe2021b7a760
tcp_usr_send() at 0xffffffff8062078b = tcp_usr_send+0x33b/frame 0xfffffe2021b7a7e0
sosend_generic() at 0xffffffff8051cf36 = sosend_generic+0x436/frame 0xfffffe2021b7a8a0
soo_write() at 0xffffffff804fd872 = soo_write+0x42/frame 0xfffffe2021b7a8d0
dofilewrite() at 0xffffffff804f5c97 = dofilewrite+0x87/frame 0xfffffe2021b7a920
kern_writev() at 0xffffffff804f5978 = kern_writev+0x68/frame 0xfffffe2021b7a970
sys_writev() at 0xffffffff804f5be6 = sys_writev+0x36/frame 0xfffffe2021b7a9a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe2021b7aab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe2021b7aab0
--- syscall (121, FreeBSD ELF64, sys_writev), rip = 0x8019cc6ba, rsp = 0x7fffffffd268, rbp = 0x7fffffffd2a0 ---

Tracing command nginx pid 1059 tid 101817 td 0xfffff8014de1da00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe20220cc240
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe20220cc270
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe20220cc2b0
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe20220cc340
in_lltable_lookup() at 0xffffffff80594823 = in_lltable_lookup+0x83/frame 0xfffffe20220cc360
arpresolve() at 0xffffffff8058d2aa = arpresolve+0x9a/frame 0xfffffe20220cc3c0
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe20220cc460
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe20220cc5c0
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe20220cc760
tcp_usr_send() at 0xffffffff8062078b = tcp_usr_send+0x33b/frame 0xfffffe20220cc7e0
sosend_generic() at 0xffffffff8051cf36 = sosend_generic+0x436/frame 0xfffffe20220cc8a0
soo_write() at 0xffffffff804fd872 = soo_write+0x42/frame 0xfffffe20220cc8d0
dofilewrite() at 0xffffffff804f5c97 = dofilewrite+0x87/frame 0xfffffe20220cc920
kern_writev() at 0xffffffff804f5978 = kern_writev+0x68/frame 0xfffffe20220cc970
sys_writev() at 0xffffffff804f5be6 = sys_writev+0x36/frame 0xfffffe20220cc9a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe20220ccab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe20220ccab0
--- syscall (121, FreeBSD ELF64, sys_writev), rip = 0x8019cc6ba, rsp = 0x7fffffffcea8, rbp = 0x7fffffffcee0 ---

Tracing command nginx pid 1058 tid 101816 td 0xfffff8014ddb2500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe20220c7330
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe20220c7360
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe20220c73a0
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe20220c7430
in_lltable_lookup() at 0xffffffff80594823 = in_lltable_lookup+0x83/frame 0xfffffe20220c7450
arpresolve() at 0xffffffff8058d2aa = arpresolve+0x9a/frame 0xfffffe20220c74b0
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe20220c7550
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe20220c76b0
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe20220c7850
tcp_usr_disconnect() at 0xffffffff80620094 = tcp_usr_disconnect+0x74/frame 0xfffffe20220c7880
soclose() at 0xffffffff8051c238 = soclose+0x38/frame 0xfffffe20220c78b0
_fdrop() at 0xffffffff8045639a = _fdrop+0x1a/frame 0xfffffe20220c78d0
closef() at 0xffffffff80458a53 = closef+0x1e3/frame 0xfffffe20220c7960
closefp() at 0xffffffff804567ad = closefp+0x7d/frame 0xfffffe20220c79a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe20220c7ab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe20220c7ab0
--- syscall (6, FreeBSD ELF64, sys_close), rip = 0x8019dbeaa, rsp = 0x7fffffffe6a8, rbp = 0x7fffffffe6c0 ---

Tracing command nginx pid 1057 tid 101815 td 0xfffff8014dde1500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe20220c2240
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe20220c2270
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe20220c22b0
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe20220c2340
in_lltable_lookup() at 0xffffffff80594823 = in_lltable_lookup+0x83/frame 0xfffffe20220c2360
arpresolve() at 0xffffffff8058d2aa = arpresolve+0x9a/frame 0xfffffe20220c23c0
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe20220c2460
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe20220c25c0
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe20220c2760
tcp_usr_send() at 0xffffffff8062078b = tcp_usr_send+0x33b/frame 0xfffffe20220c27e0
sosend_generic() at 0xffffffff8051cf36 = sosend_generic+0x436/frame 0xfffffe20220c28a0
soo_write() at 0xffffffff804fd872 = soo_write+0x42/frame 0xfffffe20220c28d0
dofilewrite() at 0xffffffff804f5c97 = dofilewrite+0x87/frame 0xfffffe20220c2920
kern_writev() at 0xffffffff804f5978 = kern_writev+0x68/frame 0xfffffe20220c2970
sys_writev() at 0xffffffff804f5be6 = sys_writev+0x36/frame 0xfffffe20220c29a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe20220c2ab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe20220c2ab0
--- syscall (121, FreeBSD ELF64, sys_writev), rip = 0x8019cc6ba, rsp = 0x7fffffffcc98, rbp = 0x7fffffffccd0 ---

Tracing command nginx pid 1055 tid 101800 td 0xfffff8014ddfda00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2022077240
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2022077270
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe20220772b0
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe2022077340
in_lltable_lookup() at 0xffffffff80594823 = in_lltable_lookup+0x83/frame 0xfffffe2022077360
arpresolve() at 0xffffffff8058d2aa = arpresolve+0x9a/frame 0xfffffe20220773c0
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe2022077460
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe20220775c0
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe2022077760
tcp_usr_send() at 0xffffffff8062078b = tcp_usr_send+0x33b/frame 0xfffffe20220777e0
sosend_generic() at 0xffffffff8051cf36 = sosend_generic+0x436/frame 0xfffffe20220778a0
soo_write() at 0xffffffff804fd872 = soo_write+0x42/frame 0xfffffe20220778d0
dofilewrite() at 0xffffffff804f5c97 = dofilewrite+0x87/frame 0xfffffe2022077920
kern_writev() at 0xffffffff804f5978 = kern_writev+0x68/frame 0xfffffe2022077970
sys_writev() at 0xffffffff804f5be6 = sys_writev+0x36/frame 0xfffffe20220779a0
amd64_syscall() at 0xffffffff806e4051 = amd64_syscall+0x2c1/frame 0xfffffe2022077ab0
Xfast_syscall() at 0xffffffff806cb2bb = Xfast_syscall+0xfb/frame 0xfffffe2022077ab0
--- syscall (121, FreeBSD ELF64, sys_writev), rip = 0x8019cc6ba, rsp = 0x7fffffffcc58, rbp = 0x7fffffffcc90 ---

Tracing command intr pid 12 tid 100015 td 0xfffff8011422b000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1cf7a0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1cf7d0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1cf810
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1cf8a0
in_losing() at 0xffffffff8059c8e8 = in_losing+0x98/frame 0xfffffe1f9e1cf8c0
tcp_timer_rexmt() at 0xffffffff8061e768 = tcp_timer_rexmt+0x428/frame 0xfffffe1f9e1cf8f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1cf9c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1cf9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1cfa20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1cfa70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1cfab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1cfab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100016 td 0xfffff8011422aa00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1d47a0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1d47d0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1d4810
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1d48a0
in_losing() at 0xffffffff8059c8e8 = in_losing+0x98/frame 0xfffffe1f9e1d48c0
tcp_timer_rexmt() at 0xffffffff8061e768 = tcp_timer_rexmt+0x428/frame 0xfffffe1f9e1d48f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1d49c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1d49e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1d4a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1d4a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1d4ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1d4ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100017 td 0xfffff8011422a500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1d9700
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1d9730
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1d9770
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1d9800
in_pcbfree() at 0xffffffff8059be10 = in_pcbfree+0x290/frame 0xfffffe1f9e1d9840
tcp_usr_detach() at 0xffffffff8061fff8 = tcp_usr_detach+0x78/frame 0xfffffe1f9e1d9860
sofree() at 0xffffffff8051c0a9 = sofree+0x109/frame 0xfffffe1f9e1d9890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1d98c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1d98f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1d99c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1d99e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1d9a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1d9a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1d9ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1d9ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100018 td 0xfffff8011422a000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1de7c0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1de7f0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1de830
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1de8c0
tcp_timer_delack() at 0xffffffff8061db1c = tcp_timer_delack+0x2c/frame 0xfffffe1f9e1de8f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1de9c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1de9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1dea20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1dea70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1deab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1deab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100019 td 0xfffff8011424da00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1e37a0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1e37d0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1e3810
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1e38a0
in_losing() at 0xffffffff8059c8e8 = in_losing+0x98/frame 0xfffffe1f9e1e38c0
tcp_timer_rexmt() at 0xffffffff8061e768 = tcp_timer_rexmt+0x428/frame 0xfffffe1f9e1e38f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1e39c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1e39e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1e3a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1e3a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1e3ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1e3ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100020 td 0xfffff8011424d500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1e87a0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1e87d0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1e8810
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1e88a0
in_losing() at 0xffffffff8059c8e8 = in_losing+0x98/frame 0xfffffe1f9e1e88c0
tcp_timer_rexmt() at 0xffffffff8061e768 = tcp_timer_rexmt+0x428/frame 0xfffffe1f9e1e88f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1e89c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1e89e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1e8a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1e8a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1e8ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1e8ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100021 td 0xfffff8011424d000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1ed3a0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1ed3d0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1ed410
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1ed4a0
in_lltable_lookup() at 0xffffffff80594823 = in_lltable_lookup+0x83/frame 0xfffffe1f9e1ed4c0
arpresolve() at 0xffffffff8058d2aa = arpresolve+0x9a/frame 0xfffffe1f9e1ed520
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe1f9e1ed5c0
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe1f9e1ed720
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe1f9e1ed8c0
tcp_timer_delack() at 0xffffffff8061db8f = tcp_timer_delack+0x9f/frame 0xfffffe1f9e1ed8f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1ed9c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1ed9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1eda20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1eda70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1edab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1edab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100022 td 0xfffff8011424ca00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1f2700
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1f2730
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1f2770
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1f2800
in_pcbfree() at 0xffffffff8059be10 = in_pcbfree+0x290/frame 0xfffffe1f9e1f2840
tcp_usr_detach() at 0xffffffff8061fff8 = tcp_usr_detach+0x78/frame 0xfffffe1f9e1f2860
sofree() at 0xffffffff8051c0a9 = sofree+0x109/frame 0xfffffe1f9e1f2890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe1f9e1f28c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe1f9e1f28f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1f29c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1f29e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1f2a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1f2a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1f2ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1f2ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100023 td 0xfffff8011424c500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1f7360
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1f7390
critical_exit() at 0xffffffff804a6bee = critical_exit+0x7e/frame 0xfffffe1f9e1f73b0
ipi_bitmap_handler() at 0xffffffff80775629 = ipi_bitmap_handler+0x79/frame 0xfffffe1f9e1f73f0
Xipi_intr_bitmap_handler() at 0xffffffff806cc15e = Xipi_intr_bitmap_handler+0x8e/frame 0xfffffe1f9e1f73f0
--- interrupt, rip = 0xffffffff806e16c4, rsp = 0xfffffe1f9e1f74c0, rbp = 0xfffffe1f9e1f74c0 ---
bcopy() at 0xffffffff806e16c4 = bcopy+0x24/frame 0xfffffe1f9e1f74c0
arpresolve() at 0xffffffff8058d381 = arpresolve+0x171/frame 0xfffffe1f9e1f7520
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe1f9e1f75c0
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe1f9e1f7720
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe1f9e1f78c0
tcp_timer_rexmt() at 0xffffffff8061e7de = tcp_timer_rexmt+0x49e/frame 0xfffffe1f9e1f78f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1f79c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1f79e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1f7a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1f7a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1f7ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1f7ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100024 td 0xfffff8011424c000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe1f9e1fc3a0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe1f9e1fc3d0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe1f9e1fc410
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe1f9e1fc4a0
in_lltable_lookup() at 0xffffffff80594823 = in_lltable_lookup+0x83/frame 0xfffffe1f9e1fc4c0
arpresolve() at 0xffffffff8058d2aa = arpresolve+0x9a/frame 0xfffffe1f9e1fc520
ether_output() at 0xffffffff805755e2 = ether_output+0x2f2/frame 0xfffffe1f9e1fc5c0
ip_output() at 0xffffffff805a4200 = ip_output+0x1390/frame 0xfffffe1f9e1fc720
tcp_output() at 0xffffffff806149d5 = tcp_output+0x17a5/frame 0xfffffe1f9e1fc8c0
tcp_timer_delack() at 0xffffffff8061db8f = tcp_timer_delack+0x9f/frame 0xfffffe1f9e1fc8f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe1f9e1fc9c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe1f9e1fc9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe1f9e1fca20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe1f9e1fca70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe1f9e1fcab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe1f9e1fcab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100025 td 0xfffff8011424ba00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe0000382700
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe0000382730
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe0000382770
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe0000382800
in_pcbfree() at 0xffffffff8059be10 = in_pcbfree+0x290/frame 0xfffffe0000382840
tcp_usr_detach() at 0xffffffff8061fff8 = tcp_usr_detach+0x78/frame 0xfffffe0000382860
sofree() at 0xffffffff8051c0a9 = sofree+0x109/frame 0xfffffe0000382890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe00003828c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe00003828f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe00003829c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe00003829e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe0000382a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe0000382a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe0000382ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe0000382ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100026 td 0xfffff8011424b500
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe0000387700
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe0000387730
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe0000387770
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe0000387800
in_pcbfree() at 0xffffffff8059be10 = in_pcbfree+0x290/frame 0xfffffe0000387840
tcp_usr_detach() at 0xffffffff8061fff8 = tcp_usr_detach+0x78/frame 0xfffffe0000387860
sofree() at 0xffffffff8051c0a9 = sofree+0x109/frame 0xfffffe0000387890
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe00003878c0
tcp_timer_2msl() at 0xffffffff8061dda3 = tcp_timer_2msl+0x1f3/frame 0xfffffe00003878f0
softclock_call_cc() at 0xffffffff804b4ca9 = softclock_call_cc+0x179/frame 0xfffffe00003879c0
softclock() at 0xffffffff804b5034 = softclock+0x44/frame 0xfffffe00003879e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe0000387a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe0000387a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe0000387ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe0000387ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100103 td 0xfffff8012508ea00
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2020ea82d0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2020ea8300
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2020ea8340
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe2020ea83d0
in_pcbfree() at 0xffffffff8059be10 = in_pcbfree+0x290/frame 0xfffffe2020ea8410
tcp_usr_detach() at 0xffffffff8061fff8 = tcp_usr_detach+0x78/frame 0xfffffe2020ea8430
sofree() at 0xffffffff8051c0a9 = sofree+0x109/frame 0xfffffe2020ea8460
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe2020ea8490
tcp_do_segment() at 0xffffffff80610226 = tcp_do_segment+0x1666/frame 0xfffffe2020ea8590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020ea86e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020ea8740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020ea87a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020ea87d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020ea8830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020ea8890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020ea88b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020ea8990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020ea89e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020ea8a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020ea8a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020ea8ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020ea8ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100105 td 0xfffff8012508e000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2020eb2290
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2020eb22c0
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2020eb2300
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe2020eb2390
in_pcbfree() at 0xffffffff8059be10 = in_pcbfree+0x290/frame 0xfffffe2020eb23d0
tcp_usr_detach() at 0xffffffff8061fff8 = tcp_usr_detach+0x78/frame 0xfffffe2020eb23f0
sofree() at 0xffffffff8051c0a9 = sofree+0x109/frame 0xfffffe2020eb2420
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe2020eb2450
tcp_twstart() at 0xffffffff8061f0e7 = tcp_twstart+0x2b7/frame 0xfffffe2020eb2490
tcp_do_segment() at 0xffffffff80610bdf = tcp_do_segment+0x201f/frame 0xfffffe2020eb2590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020eb26e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020eb2740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020eb27a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020eb27d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020eb2830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020eb2890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020eb28b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020eb2990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020eb29e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020eb2a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020eb2a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020eb2ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020eb2ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100107 td 0xfffff8012508d500
cpustop_handler() at 0xffffffff80775998 = cpustop_handler+0x28/frame 0xfffffe1f9e16dcf0
ipi_nmi_handler() at 0xffffffff8077595a = ipi_nmi_handler+0x4a/frame 0xfffffe1f9e16dd10
trap() at 0xffffffff806e2e4a = trap+0x3a/frame 0xfffffe1f9e16df20
nmi_calltrap() at 0xffffffff806cb413 = nmi_calltrap+0x8/frame 0xfffffe1f9e16df20
--- trap 0x13, rip = 0xffffffff8049c71d, rsp = 0xfffffe2020ebc410, rbp = 0xfffffe2020ebc420 ---
__rw_try_wlock() at 0xffffffff8049c71d = __rw_try_wlock+0x4d/frame 0xfffffe2020ebc420
tcp_tw_2msl_scan() at 0xffffffff8061f1f6 = tcp_tw_2msl_scan+0xc6/frame 0xfffffe2020ebc450 /usr/src/sys/netinet/tcp_timewait.c:707
tcp_twstart() at 0xffffffff8061eed3 = tcp_twstart+0xa3/frame 0xfffffe2020ebc490
tcp_do_segment() at 0xffffffff80610bdf = tcp_do_segment+0x201f/frame 0xfffffe2020ebc590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020ebc6e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020ebc740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020ebc7a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020ebc7d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020ebc830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020ebc890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020ebc8b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020ebc990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020ebc9e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020ebca20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020ebca70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020ebcab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020ebcab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Tracing command intr pid 12 tid 100111 td 0xfffff801250a2000
sched_switch() at 0xffffffff804c956d = sched_switch+0x6ad/frame 0xfffffe2020f302d0
mi_switch() at 0xffffffff804a8d92 = mi_switch+0xd2/frame 0xfffffe2020f30300
turnstile_wait() at 0xffffffff804ef177 = turnstile_wait+0x2a7/frame 0xfffffe2020f30340
__rw_wlock_hard() at 0xffffffff8049c314 = __rw_wlock_hard+0x94/frame 0xfffffe2020f303d0
in_pcbfree() at 0xffffffff8059be10 = in_pcbfree+0x290/frame 0xfffffe2020f30410
tcp_usr_detach() at 0xffffffff8061fff8 = tcp_usr_detach+0x78/frame 0xfffffe2020f30430
sofree() at 0xffffffff8051c0a9 = sofree+0x109/frame 0xfffffe2020f30460
tcp_close() at 0xffffffff80618150 = tcp_close+0xd0/frame 0xfffffe2020f30490
tcp_do_segment() at 0xffffffff8060f6cb = tcp_do_segment+0xb0b/frame 0xfffffe2020f30590
tcp_input() at 0xffffffff8060e17c = tcp_input+0xe1c/frame 0xfffffe2020f306e0
ip_input() at 0xffffffff805a087f = ip_input+0x15f/frame 0xfffffe2020f30740
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020f307a0
ether_demux() at 0xffffffff80575b3a = ether_demux+0x12a/frame 0xfffffe2020f307d0
ether_nh_input() at 0xffffffff80576792 = ether_nh_input+0x322/frame 0xfffffe2020f30830
netisr_dispatch_src() at 0xffffffff80583db5 = netisr_dispatch_src+0xa5/frame 0xfffffe2020f30890
ether_input() at 0xffffffff80575db6 = ether_input+0x26/frame 0xfffffe2020f308b0
ixgbe_rxeof() at 0xffffffff813df36b = ixgbe_rxeof+0x7ab/frame 0xfffffe2020f30990
ixgbe_msix_que() at 0xffffffff813da57c = ixgbe_msix_que+0x8c/frame 0xfffffe2020f309e0
intr_event_execute_handlers() at 0xffffffff8046c605 = intr_event_execute_handlers+0x95/frame 0xfffffe2020f30a20
ithread_loop() at 0xffffffff8046cc26 = ithread_loop+0xa6/frame 0xfffffe2020f30a70
fork_exit() at 0xffffffff8046a211 = fork_exit+0x71/frame 0xfffffe2020f30ab0
fork_trampoline() at 0xffffffff806cb50e = fork_trampoline+0xe/frame 0xfffffe2020f30ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

current version of /usr/src/sys/netinet/tcp_timewait.c

700 inp = tw->tw_inpcb;
701 in_pcbref(inp);
702 TW_RUNLOCK(V_tw_lock);
703
704 //retry:
705 if (INP_INFO_TRY_RLOCK(&V_tcbinfo)) {
706
707 if (INP_TRY_WLOCK(inp)) {
708 tw = intotw(inp);
709 if (in_pcbrele_wlocked(inp)) {
710 KASSERT(tw == NULL, ("%s: held last inp "
711 "reference but tw not NULL", __func__));
712 INP_INFO_RUNLOCK(&V_tcbinfo);
713 continue;
714 }
715
716 if (tw == NULL) {
717 /* tcp_twclose() has already been called */
718 INP_WUNLOCK(inp);
719 INP_INFO_RUNLOCK(&V_tcbinfo);
720 continue;
721 }
722
723 tcp_twclose(tw, reuse);
724 INP_INFO_RUNLOCK(&V_tcbinfo);
725 if (reuse)
726 return tw;
727 } else {
728 INP_INFO_RUNLOCK(&V_tcbinfo);
729 goto retry;
730 }
731 } else {
732 /* INP_INFO lock is busy, continue later. */
733 if (INP_TRY_WLOCK(inp)) {
734 if (!in_pcbrele_wlocked(inp))
735 INP_WUNLOCK(inp);
736 break;
737 } else {
738 goto retry;
739 }
740 }
741 }
742 retry:
743 return NULL;
744 }

Torfinn Ingolfsen

unread,
Sep 30, 2016, 8:42:32 AM9/30/16
to freebsd...@freebsd.org
On Wed, 28 Sep 2016 12:06:47 +0200
Julien Charbon <julien....@gmail.com> wrote:

>
> I am still trying to reproduce your issue, without success so far.

All: please remember to trim your quotes. Thank you.
Carry on.
--
Regards,
Torfinn Ingolfsen <torfinn....@getmail.no>

hiren panchasara

unread,
Oct 6, 2016, 3:45:30 AM10/6/16
to Julien Charbon, Konstantin Belousov, Julien Charbon, freebsd...@freebsd.org, Slawa Olhovchenkov
On 10/06/16 at 09:28P, Julien Charbon wrote:
>
> Hi,
>
> On 9/28/16 1:59 PM, Slawa Olhovchenkov wrote:
> > On Wed, Sep 28, 2016 at 12:06:47PM +0200, Julien Charbon wrote:
> >> <SNIP>
> >> I am still trying to reproduce your issue, without success so far.
>
> Thanks for Slawa effort and multiple debug report we start seeing the
> bottom of this issue and it seems to be a generic one. The most useful
> report being:
>
> panic: tcp_detach: INP_TIMEWAIT && INP_DROPPED && tp != NULL

I know there are multiple and probably related problems being
discussed here but what about the one mentioned in subject of this
thread?
Apologies if I've missed something conclusive in one of the replies of
this thread about that issue.

Cheers,
Hiren

hiren panchasara

unread,
Oct 6, 2016, 4:07:18 AM10/6/16
to Julien Charbon, Konstantin Belousov, Julien Charbon, freebsd...@freebsd.org, Slawa Olhovchenkov
On 10/06/16 at 09:51P, Julien Charbon wrote:
>
> Hi Hiren,
> This issue can lead the machine being stuck on high network load, by
> double freeing an inp, you can corrupt/leak an inp lock, and the network
> stack can wait definitely on this inp lock to be released. You get this
> assert only with INVARIANTS defined.
>
> Of usual, we can have more than one issue here, but this
> INP_TIMEWAI|INP_DROPPED issue need to be fixed anyway.

Thanks for the explanation, Julien.

Cheers,
Hiren

Julien Charbon

unread,
Oct 6, 2016, 5:00:11 AM10/6/16
to hiren panchasara, Konstantin Belousov, Julien Charbon, freebsd...@freebsd.org, Slawa Olhovchenkov

Hi Hiren,

On 10/6/16 9:44 AM, hiren panchasara wrote:
This issue can lead the machine being stuck on high network load, by
double freeing an inp, you can corrupt/leak an inp lock, and the network
stack can wait definitely on this inp lock to be released. You get this
assert only with INVARIANTS defined.

Of usual, we can have more than one issue here, but this
INP_TIMEWAI|INP_DROPPED issue need to be fixed anyway.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 6, 2016, 7:11:11 AM10/6/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara, Julien Charbon
On Thu, Oct 06, 2016 at 09:28:06AM +0200, Julien Charbon wrote:

> 2. thread1: In tcp_close() the inp is marked with INP_DROPPED flag, the
> process continues and calls INP_WUNLOCK() here:
>
> https://github.com/freebsd/freebsd/blob/releng/11.0/sys/netinet/tcp_subr.c#L1568

Look also to sys/netinet/tcp_timewait.c:488

And check other locks from r160549

Julien Charbon

unread,
Oct 6, 2016, 7:15:45 AM10/6/16
to Slawa Olhovchenkov, Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi,

On 9/28/16 1:59 PM, Slawa Olhovchenkov wrote:
> On Wed, Sep 28, 2016 at 12:06:47PM +0200, Julien Charbon wrote:
>> <SNIP>
>> I am still trying to reproduce your issue, without success so far.

Thanks for Slawa effort and multiple debug report we start seeing the
bottom of this issue and it seems to be a generic one. The most useful
report being:

panic: tcp_detach: INP_TIMEWAIT && INP_DROPPED && tp != NULL
cpuid = 4
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff8032467b =
db_trace_self_wrapper+0x2b/frame 0xfffffe1f9e1f8730
vpanic() at 0xffffffff804b5672 = vpanic+0x182/frame 0xfffffe1f9e1f87b0
kassert_panic() at 0xffffffff804b54e6 = kassert_panic+0x126/frame
0xfffffe1f9e1f8820
tcp_usr_detach() at 0xffffffff806564dc = tcp_usr_detach+0x1bc/frame
0xfffffe1f9e1f8850
sofree() at 0xffffffff8053de66 = sofree+0x1a6/frame 0xfffffe1f9e1f8880
tcp_close() at 0xffffffff8064dd8e = tcp_close+0x11e/frame 0xfffffe1f9e1f88b0
tcp_timer_2msl() at 0xffffffff80653c28 = tcp_timer_2msl+0x278/frame
0xfffffe1f9e1f88e0
softclock_call_cc() at 0xffffffff804cbacc =
softclock_call_cc+0x19c/frame 0xfffffe1f9e1f89c0
softclock() at 0xffffffff804cbec7 = softclock+0x47/frame 0xfffffe1f9e1f89e0
intr_event_execute_handlers() at 0xffffffff8047aa86 =
intr_event_execute_handlers+0x96/frame 0xfffffe1f9e1f8a20
ithread_loop() at 0xffffffff8047b106 = ithread_loop+0xa6/frame
0xfffffe1f9e1f8a70
fork_exit() at 0xffffffff804781b4 = fork_exit+0x84/frame 0xfffffe1f9e1f8ab0
fork_trampoline() at 0xffffffff80713fce = fork_trampoline+0xe/frame
0xfffffe1f9e1f8ab0

The scenario:

1. thread1: tcp_timer_2msl() expires and tcp_close() is called to clean
this TCP connection.

2. thread1: In tcp_close() the inp is marked with INP_DROPPED flag, the
process continues and calls INP_WUNLOCK() here:

https://github.com/freebsd/freebsd/blob/releng/11.0/sys/netinet/tcp_subr.c#L1568

3. thread2: Now because INP_WLOCK is released, the inp can transition
to INP_TIMEWAIT state and nothing is preventing it.

4. thread2: During the INP_TIMEWAIT state transition, the inp is marked
with INP_TIMEWAIT flag.

5. thread1: Back in business and tcp_close() call continues with
sofree() -> tcp_usr_detach() -> tcp_detach(). Then as inp is marked
with INP_DROPPED|INP_TIMEWAIT flags, in_pcbfree() is called. w/
INVARIANTS you have an assertion here, w/o INVARIANTS process continues.

6. Later: tcp_twclose() cleans up this INP_TIMEWAIT inp and calls
in_pcbfree() again to achieve a fancy inp double-free.

This issue is a tricky one and seems here since quite a while. It has
been witness at least once in 10.1 and by two different people in 11.0.

Astute questions:

o Why INP_DROPPED flag is not tested in tcp_input() in the first place?
When you are marked as INP_DROPPED, you are almost dead, you should not
be allowed to transition to a different state!

Good point, and tcp_input() relies on the fact that INP_DROPPED inps
are no more in TCP hash table. But tcp_input() in some cases do relock
INP (see relocked: label) and if it does check a lot of things after
having relocked the inp it does not check for a recently added
INP_DROPPED flag.

o Why tcp_detach() does an unconditional in_pcbfree() for inps in
TIMEWAIT state? This because inps in TIMEWAIT state have only one exit:
Being freed. And it is the duty of tcp_detach() to free all inps with
INP_DROPPED|INP_TIMEWAIT.

o Why this issue is so rare?

Good question, I can see how to have a specific TCP traffic to make it
more frequent but no definitive answer yet.

Fix proposal:

This issue description is still a bit fresh but I would enforce that an
inp with INP_DROPPED flag should not be allowed to change state.

Thing learned:

When re-locking an inp, it might have changed a lot, and you might not
like what it became.

Thanks again to Slawa, for his numerous debug reports and always
questioning my explanations. His last question directly led to this
finding. He is testing a quick workaround patch to check if there is more.

I will create a review with a fix proposal, and don't hesitate if you
have other comments on this issue.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 7, 2016, 7:56:01 AM10/7/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara, Julien Charbon
On Thu, Oct 06, 2016 at 09:28:06AM +0200, Julien Charbon wrote:

> Thanks again to Slawa, for his numerous debug reports and always
> questioning my explanations. His last question directly led to this
> finding. He is testing a quick workaround patch to check if there is more.

Thanks very match! You was very helpful, explaining detail of FreeBSD
TCP code and gave a lot of work to this issuse, I'm appreciate all
your help!
Thanks again!

Julien Charbon

unread,
Oct 10, 2016, 7:27:25 AM10/10/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi,

On 10/6/16 1:10 PM, Slawa Olhovchenkov wrote:
> On Thu, Oct 06, 2016 at 09:28:06AM +0200, Julien Charbon wrote:
>
>> 2. thread1: In tcp_close() the inp is marked with INP_DROPPED flag, the
>> process continues and calls INP_WUNLOCK() here:
>>
>> https://github.com/freebsd/freebsd/blob/releng/11.0/sys/netinet/tcp_subr.c#L1568
>
> Look also to sys/netinet/tcp_timewait.c:488
>
> And check other locks from r160549

You are right, and here the a fix proposal for this issue:

Fix a double-free when an inp transitions to INP_TIMEWAIT state after
having been dropped
https://reviews.freebsd.org/D8211

It basically enforces in_pcbdrop() logic in tcp_input(): A INP_DROPPED
inpcb should never be proceed further.

Slawa, as you are the only one to reproduce this issue currently, could
test this patch? (And remove the temporary patch I did provided to you
before).

I will wait for your tests results before pushing further.

Thanks!

diff --git a/sys/netinet/tcp_input.c b/sys/netinet/tcp_input.c
index c72f01f..37f27e0 100644
--- a/sys/netinet/tcp_input.c
+++ b/sys/netinet/tcp_input.c
@@ -921,6 +921,16 @@ findpcb:
goto dropwithreset;
}
INP_WLOCK_ASSERT(inp);
+ /*
+ * While waiting for inp lock during the lookup, another thread
+ * can have droppedt the inpcb, in which case we need to loop back
+ * and try to find a new inpcb to deliver to.
+ */
+ if (inp->inp_flags & INP_DROPPED) {
+ INP_WUNLOCK(inp);
+ inp = NULL;
+ goto findpcb;
+ }
if ((inp->inp_flowtype == M_HASHTYPE_NONE) &&
(M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) &&
((inp->inp_socket == NULL) ||
@@ -981,6 +991,10 @@ relocked:
if (in_pcbrele_wlocked(inp)) {
inp = NULL;
goto findpcb;
+ } else if (inp->inp_flags & INP_DROPPED) {
+ INP_WUNLOCK(inp);
+ inp = NULL;
+ goto findpcb;
}
} else
ti_locked = TI_RLOCKED;
@@ -1040,6 +1054,10 @@ relocked:
if (in_pcbrele_wlocked(inp)) {
inp = NULL;
goto findpcb;
+ } else if (inp->inp_flags & INP_DROPPED) {
+ INP_WUNLOCK(inp);
+ inp = NULL;
+ goto findpcb;
}
goto relocked;
} else

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 10, 2016, 7:31:13 AM10/10/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Mon, Oct 10, 2016 at 01:26:12PM +0200, Julien Charbon wrote:

>
> Hi,
>
> On 10/6/16 1:10 PM, Slawa Olhovchenkov wrote:
> > On Thu, Oct 06, 2016 at 09:28:06AM +0200, Julien Charbon wrote:
> >
> >> 2. thread1: In tcp_close() the inp is marked with INP_DROPPED flag, the
> >> process continues and calls INP_WUNLOCK() here:
> >>
> >> https://github.com/freebsd/freebsd/blob/releng/11.0/sys/netinet/tcp_subr.c#L1568
> >
> > Look also to sys/netinet/tcp_timewait.c:488
> >
> > And check other locks from r160549
>
> You are right, and here the a fix proposal for this issue:
>
> Fix a double-free when an inp transitions to INP_TIMEWAIT state after
> having been dropped
> https://reviews.freebsd.org/D8211
>
> It basically enforces in_pcbdrop() logic in tcp_input(): A INP_DROPPED
> inpcb should never be proceed further.
>
> Slawa, as you are the only one to reproduce this issue currently, could
> test this patch? (And remove the temporary patch I did provided to you
> before).
>
> I will wait for your tests results before pushing further.

OK, I am will try it tomorrow
Thanks!

Slawa Olhovchenkov

unread,
Oct 10, 2016, 9:32:50 AM10/10/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Mon, Oct 10, 2016 at 01:26:12PM +0200, Julien Charbon wrote:

>
Are you sure about this goto?
Can this cause infinite loop by found same inpcb?
May be drop packet is more correct?

Julien Charbon

unread,
Oct 10, 2016, 10:04:31 AM10/10/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Good question: Infinite loop is not possible here, as the next TCP
hash lookup will return NULL or a fresh new and not dropped inp. You
can check the current other usages of goto findpcb in tcp_input(). The
rational here being:

- Behavior before the patch: If the inp we found was deleted then goto
findpcb.
- Behavior after the patch: If the inp we found was deleted or dropped
then goto findpcb.

I just prefer having the same behavior applied everywhere: If
tcp_input() loses the inp lock race and the inp was deleted or dropped
then retry to find a new inpcb to deliver to.

But you are right dropping the packet here will also fix the issue.

Then the review process becomes quite helpful because people can argue:
Dropping here is better because "blah", or goto findpcb is better
because "bluh", etc. And at the review end you have a nice final patch.

https://reviews.freebsd.org/D8211

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 10, 2016, 10:30:09 AM10/10/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
I am not expert in this api and don't see cause of this: I am assume
hash lookup don't remove from hash returned args and I am don't see
any removing of this inp. Why hash lookup don't return same inp?

(assume this input patch interrupt callout code on the same CPU core).

> can check the current other usages of goto findpcb in tcp_input(). The
> rational here being:
>
> - Behavior before the patch: If the inp we found was deleted then goto
> findpcb.
> - Behavior after the patch: If the inp we found was deleted or dropped
> then goto findpcb.
>
> I just prefer having the same behavior applied everywhere: If
> tcp_input() loses the inp lock race and the inp was deleted or dropped
> then retry to find a new inpcb to deliver to.
>
> But you are right dropping the packet here will also fix the issue.
>
> Then the review process becomes quite helpful because people can argue:
> Dropping here is better because "blah", or goto findpcb is better
> because "bluh", etc. And at the review end you have a nice final patch.
>
> https://reviews.freebsd.org/D8211

I am not sure, I am see to

sys/netinet/in_pcb.h:#define INP_DROPPED 0x04000000 /* protocol drop flag */

and think this is a flag 'all packets must be droped'

Julien Charbon

unread,
Oct 10, 2016, 11:59:40 AM10/10/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi,
Hm, I believe this flag means "this inp has been dropped by the TCP
stack, so don't use it anymore". Actually this flag is better described
in the function that sets it:

"(INP_DROPPED) is used by TCP to mark an inpcb as unused and avoid
future packet delivery or event notification when a socket remains open
but TCP has closed."

https://github.com/freebsd/freebsd/blob/release/11.0.0/sys/netinet/in_pcb.c#L1320

/*
* in_pcbdrop() removes an inpcb from hashed lists, releasing its
address and
* port reservation, and preventing it from being returned by inpcb lookups.
*
* It is used by TCP to mark an inpcb as unused and avoid future packet
* delivery or event notification when a socket remains open but TCP has
* closed. This might occur as a result of a shutdown()-initiated TCP close
* or a RST on the wire, and allows the port binding to be reused while
still
* maintaining the invariant that so_pcb always points to a valid inpcb
until
* in_pcbdetach().
*
*/
void
in_pcbdrop(struct inpcb *inp)
{
inp->inp_flags |= INP_DROPPED;
...

The classical example where "goto findpcb" is useful: You receive a
new connection request with a TCP SYN packet and this packet is unlucky
and reached a inp being dropped:

- with "goto findpcb" approach, the next lookup will most likely find
the LISTEN inp and start the TCP hand-shake as usual
- with "drop the packet" approach, the TCP client will need to
re-transmit a TCP SYN packet

It is not because a packet was unlucky once that it deserves to be
dropped. :)

--
Julien


signature.asc

Slawa Olhovchenkov

unread,
Oct 10, 2016, 1:36:02 PM10/10/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
Thanks for explaining, very helpfull.
In this situation (TCP SYN with same 4-tuple as existing socket)
allocate new PCB is best. But for this we must destroy current PCB. I
am think INP_WUNLOCK(inp) don't destroy it and in_pcblookup_mbuf find
it again (I am think in_pcblookup_mbuf find this PCB on first turn).
I am assume for classical example in_pcbrele_wlocked(inp) free and
destroy current PCB for possibility in_pcblookup_mbuf allocate new
one.

Julien Charbon

unread,
Oct 11, 2016, 3:21:05 AM10/11/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Astute question: Here, the same inp cannot be find again by
in_pcblookup_mbuf(), the explanation is a bit long though:

in_pcbdrop() does two things under INP_WLOCK lock:
https://github.com/freebsd/freebsd/blob/release/11.0.0/sys/netinet/in_pcb.c#L1334

1. Add INP_DROPPED flag
2. Remove the inp from the TCP hash table

And once removed from the TCP hash table, in_pcblookup_mbuf() will
return NULL when doing the same TCP hash table lookup again.

It means that under a INP_WLOCK lock, these two changes are atomic:

- Either an inp does not have INP_DROPPED flag and can be in TCP hash table
- Either an inp has INP_DROPPED flag and is not in TCP hash table

But when you don't have the INP_WLOCK lock, then you can witness the
intermediate state where a inp is still in TCP hash table while a thread
is adding the INP_DROPPED flag.

Nothing unusual here.

Then threads are competing for the INP_WLOCK lock. For the example,
let's say the thread A wants to run tcp_input()/in_pcblookup_mbuf() and
racing for this INP_WLOCK:

https://github.com/freebsd/freebsd/blob/release/11.0.0/sys/netinet/in_pcb.c#L1964

And thread B wants to run tcp_timer_2msl()/tcp_close()/in_pcbdrop() and
racing for this INP_WLOCK:

https://github.com/freebsd/freebsd/blob/release/11.0.0/sys/netinet/tcp_timer.c#L323

That leads to two cases:

o Thread A wins the race:

Thread A will continue tcp_input() as usal and INP_DROPPED flags is
not set and inp is still in TCP hash table.
Thread B is waiting on thread A to release INP_WLOCK after finishing
tcp_input() processing, and thread B will continue
tcp_timer_2msl()/tcp_close()/in_pcbdrop() processing.

o Thread B wins the race:

Thread B runs tcp_timer_2msl()/tcp_close()/in_pcbdrop() and inp
INP_DROPPED is set and inp being removed from TCP hash table.
In parallel, thread A has found the inp in TCP hash before is was
removed, and waiting on the found inp INP_WLOCK lock.
Once thread B has released the INP_WLOCK lock, thread A gets this lock
and sees the INP_DROPPED flag and do "goto findpcb" but here because the
inp is not more in TCP hash table and it will not be find again by
in_pcblookup_mbuf().

Hopefully I am clear enough here.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 11, 2016, 8:12:22 AM10/11/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
Thanks, very clear.
Small qeustion: when both thread run on same CPU core, INP_WLOCK will
be re-schedule?

As I remeber race created by call tcp_twstart() at time of end
tcp_close(), at path sofree()-tcp_usr_detach() and unexpected
INP_TIMEWAIT state in the tcp_usr_detach(). INP_TIMEWAIT set in tcp_twstart()

After check source code I am found invocation of tcp_twstart() in
sys/netinet/tcp_stacks/fastpath.c, sys/netinet/tcp_input.c,
sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c, sys/dev/cxgbe/tom/t4_cpl_io.c.

Invocation from sys/netinet/tcp_stacks/fastpath.c and
sys/netinet/tcp_input.c guarded by INP_WLOCK in tcp_input(), and now
will be OK.

Invocation from sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c and
sys/dev/cxgbe/tom/t4_cpl_io.c is not clear to me, I am see independed
INP_WLOCK. Is this OK?

Can be thread A wants do_peer_close() directed from chelsio IRQ
handler, bypass tcp_input()?

Julien Charbon

unread,
Oct 12, 2016, 4:19:12 AM10/12/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Hmm, a thread can re-scheduled but not a lock. Thus no sure I
understand your question here. :)

> As I remeber race created by call tcp_twstart() at time of end
> tcp_close(), at path sofree()-tcp_usr_detach() and unexpected
> INP_TIMEWAIT state in the tcp_usr_detach(). INP_TIMEWAIT set in tcp_twstart()

Exactly, thus the current fix is: If you already have the INP_DROPPED
flag set you are not allowed to call tcp_twstart(), actually it is a
good candidate for a new INVARIANT. Let me add that.

> After check source code I am found invocation of tcp_twstart() in
> sys/netinet/tcp_stacks/fastpath.c, sys/netinet/tcp_input.c,
> sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c, sys/dev/cxgbe/tom/t4_cpl_io.c.
>
> Invocation from sys/netinet/tcp_stacks/fastpath.c and
> sys/netinet/tcp_input.c guarded by INP_WLOCK in tcp_input(), and now
> will be OK.
>
> Invocation from sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c and
> sys/dev/cxgbe/tom/t4_cpl_io.c is not clear to me, I am see independed
> INP_WLOCK. Is this OK?
>
> Can be thread A wants do_peer_close() directed from chelsio IRQ
> handler, bypass tcp_input()?

If you look carefully INP_WLOCK is used in cxgb_cpl_io.c and
t4_cpl_io.c before calling tcp_twstart().

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 12, 2016, 4:41:16 AM10/12/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
I am don't know how work INP_WLOCK in this case (all on same cpu):

thread1: INP_WLOCK
-interrupt--
thread2: INP_WLOCK

if INP_WLOCK is like spinlock -- this is dead lock.
if INP_WLOCK is like mutex -- thread1 resheduled.

> > As I remeber race created by call tcp_twstart() at time of end
> > tcp_close(), at path sofree()-tcp_usr_detach() and unexpected
> > INP_TIMEWAIT state in the tcp_usr_detach(). INP_TIMEWAIT set in tcp_twstart()
>
> Exactly, thus the current fix is: If you already have the INP_DROPPED
> flag set you are not allowed to call tcp_twstart(), actually it is a
> good candidate for a new INVARIANT. Let me add that.
>
> > After check source code I am found invocation of tcp_twstart() in
> > sys/netinet/tcp_stacks/fastpath.c, sys/netinet/tcp_input.c,
> > sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c, sys/dev/cxgbe/tom/t4_cpl_io.c.
> >
> > Invocation from sys/netinet/tcp_stacks/fastpath.c and
> > sys/netinet/tcp_input.c guarded by INP_WLOCK in tcp_input(), and now
> > will be OK.
> >
> > Invocation from sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c and
> > sys/dev/cxgbe/tom/t4_cpl_io.c is not clear to me, I am see independed
> > INP_WLOCK. Is this OK?
> >
> > Can be thread A wants do_peer_close() directed from chelsio IRQ
> > handler, bypass tcp_input()?
>
> If you look carefully INP_WLOCK is used in cxgb_cpl_io.c and
> t4_cpl_io.c before calling tcp_twstart().

Yes, and you remeber: sys/netinet/tcp_subr.c

1535 struct tcpcb *
1536 tcp_close(struct tcpcb *tp)
1537 {
..
1569 INP_WUNLOCK(inp);
1570 ACCEPT_LOCK();
1571 SOCK_LOCK(so);
1572 so->so_state &= ~SS_PROTOREF;
1573 sofree(so);
1574 return (NULL);

sofree() call tcp_usr_detach() and in tcp_usr_detach() we have
unexpected INP_TIMEWAIT.

Slawa Olhovchenkov

unread,
Oct 12, 2016, 5:30:13 AM10/12/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Wed, Oct 12, 2016 at 11:19:48AM +0200, Julien Charbon wrote:

> > if INP_WLOCK is like spinlock -- this is dead lock.
> > if INP_WLOCK is like mutex -- thread1 resheduled.
>
> Thanks, I understand you question now. No an interrupt cannot bypass a
> lock: Here INP_WLOCK is like mutex -- thread1 resheduled.

Thanks, nice.

> >>> As I remeber race created by call tcp_twstart() at time of end
> >>> tcp_close(), at path sofree()-tcp_usr_detach() and unexpected
> >>> INP_TIMEWAIT state in the tcp_usr_detach(). INP_TIMEWAIT set in tcp_twstart()
> >>
> >> Exactly, thus the current fix is: If you already have the INP_DROPPED
> >> flag set you are not allowed to call tcp_twstart(), actually it is a
> >> good candidate for a new INVARIANT. Let me add that.
> >>
> >>> After check source code I am found invocation of tcp_twstart() in
> >>> sys/netinet/tcp_stacks/fastpath.c, sys/netinet/tcp_input.c,
> >>> sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c, sys/dev/cxgbe/tom/t4_cpl_io.c.
> >>>
> >>> Invocation from sys/netinet/tcp_stacks/fastpath.c and
> >>> sys/netinet/tcp_input.c guarded by INP_WLOCK in tcp_input(), and now
> >>> will be OK.
> >>>
> >>> Invocation from sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c and
> >>> sys/dev/cxgbe/tom/t4_cpl_io.c is not clear to me, I am see independed
> >>> INP_WLOCK. Is this OK?
> >>>
> >>> Can be thread A wants do_peer_close() directed from chelsio IRQ
> >>> handler, bypass tcp_input()?
> >>
> >> If you look carefully INP_WLOCK is used in cxgb_cpl_io.c and
> >> t4_cpl_io.c before calling tcp_twstart().
> >
> > Yes, and you remeber: sys/netinet/tcp_subr.c
> >
> > 1535 struct tcpcb *
> > 1536 tcp_close(struct tcpcb *tp)
> > 1537 {
> > ...
> > 1569 INP_WUNLOCK(inp);
> > 1570 ACCEPT_LOCK();
> > 1571 SOCK_LOCK(so);
> > 1572 so->so_state &= ~SS_PROTOREF;
> > 1573 sofree(so);
> > 1574 return (NULL);
> >
> > sofree() call tcp_usr_detach() and in tcp_usr_detach() we have
> > unexpected INP_TIMEWAIT.
>
> I see, thus just for the context: The TCP stack in sys/dev/cxgb* is a
> TOE (TCP Offload Engine?) TCP stack for Chelsio NICs, it is a
> separate/side TCP stack that is used only with TCP_OFFLOAD option.
>
> This TOE TCP stack actually has its own set of detach()/input()
> functions and seems to check INP_DROPPED flag properly. I guess @np
> check fixes in socket TCP stack and decides which one can also impact
> the Chelsio TOE TCP stack. Some bugs are only in socket TCP stack, some
> are only in TOE TCP stack.

I am fear about other direction -- setting INP_TIMEWAIT in Chelsio TOE
TCP stack and impact this to
tcp_timer_2msl()/tcp_close()/sofree()/tcp_usr_detach() path.

Julien Charbon

unread,
Oct 12, 2016, 5:43:26 AM10/12/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
I see, I expect no problem on this side as tcp_timer_2msl() checks the
INP_TIMEWAIT flag and do not call tcp_close() if set.

--
Julien

signature.asc

Julien Charbon

unread,
Oct 12, 2016, 5:49:54 AM10/12/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Thanks, I understand you question now. No an interrupt cannot bypass a
lock: Here INP_WLOCK is like mutex -- thread1 resheduled.

>>> As I remeber race created by call tcp_twstart() at time of end
>>> tcp_close(), at path sofree()-tcp_usr_detach() and unexpected
>>> INP_TIMEWAIT state in the tcp_usr_detach(). INP_TIMEWAIT set in tcp_twstart()
>>
>> Exactly, thus the current fix is: If you already have the INP_DROPPED
>> flag set you are not allowed to call tcp_twstart(), actually it is a
>> good candidate for a new INVARIANT. Let me add that.
>>
>>> After check source code I am found invocation of tcp_twstart() in
>>> sys/netinet/tcp_stacks/fastpath.c, sys/netinet/tcp_input.c,
>>> sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c, sys/dev/cxgbe/tom/t4_cpl_io.c.
>>>
>>> Invocation from sys/netinet/tcp_stacks/fastpath.c and
>>> sys/netinet/tcp_input.c guarded by INP_WLOCK in tcp_input(), and now
>>> will be OK.
>>>
>>> Invocation from sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c and
>>> sys/dev/cxgbe/tom/t4_cpl_io.c is not clear to me, I am see independed
>>> INP_WLOCK. Is this OK?
>>>
>>> Can be thread A wants do_peer_close() directed from chelsio IRQ
>>> handler, bypass tcp_input()?
>>
>> If you look carefully INP_WLOCK is used in cxgb_cpl_io.c and
>> t4_cpl_io.c before calling tcp_twstart().
>
> Yes, and you remeber: sys/netinet/tcp_subr.c
>
> 1535 struct tcpcb *
> 1536 tcp_close(struct tcpcb *tp)
> 1537 {
> ...
> 1569 INP_WUNLOCK(inp);
> 1570 ACCEPT_LOCK();
> 1571 SOCK_LOCK(so);
> 1572 so->so_state &= ~SS_PROTOREF;
> 1573 sofree(so);
> 1574 return (NULL);
>
> sofree() call tcp_usr_detach() and in tcp_usr_detach() we have
> unexpected INP_TIMEWAIT.

I see, thus just for the context: The TCP stack in sys/dev/cxgb* is a
TOE (TCP Offload Engine?) TCP stack for Chelsio NICs, it is a
separate/side TCP stack that is used only with TCP_OFFLOAD option.

This TOE TCP stack actually has its own set of detach()/input()
functions and seems to check INP_DROPPED flag properly. I guess @np
check fixes in socket TCP stack and decides which one can also impact
the Chelsio TOE TCP stack. Some bugs are only in socket TCP stack, some
are only in TOE TCP stack.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 12, 2016, 5:53:02 AM10/12/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
I am about case when at time of first INP_WUNLOCK() tcp_timer_2msl()
don't see INP_TIMEWAIT, call tcp_close(), tcp_close() do INP_WUNLOCK()
and now Chelsio TOE take INP_WLOCK, do tcp_twstart() and set
INP_TIMEWAIT. After this tcp_timer_2msl resume and have unexpected
INP_TIMEWAIT in tcp_usr_detach().

Julien Charbon

unread,
Oct 12, 2016, 8:08:00 AM10/12/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Sure, basically the same bug that in classic TCP stack. If you think
it can happen, send an email describing that to np@ and he will check
and fix that. He is a TOE TCP stack expert and I am not. In all cases,
if this issue is possible in TOE TCP stack context, the patch will be
straightforward: If the INP_DROPPED flag is set do not call tcp_twstart().

The current patch focuses only on the classic TCP stack.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 12, 2016, 8:13:58 AM10/12/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
May be current workaround (with logging) in tcp_usr_detach() is good
solutuion for preventing system lockout by similar bugs?

Julien Charbon

unread,
Oct 12, 2016, 8:36:05 AM10/12/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Good question, the quick workaround in tcp_usr_detach() does not handle
all the cases. If it reduces the number of crashes you can still find
scenarios where it can have unexpected side effect.

Long term solution is to enforce: If the inp has the INP_DROPPED flag
just stop processing it and return. If you grep the INP_DROPPED flag in
kernel sources, you can see that this test is already done in almost all
tcp_*() processing functions but tcp_input().

I would say that even without this issue tcp_input() should check
INP_DROPPED flags after INP_WLOCK anyway. Same for the TOE TCP stack,
you are simply not supposed to process a inp with INP_DROPPED flag.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 12, 2016, 9:01:35 AM10/12/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
This is best then guaranted lockout.

> Long term solution is to enforce: If the inp has the INP_DROPPED flag
> just stop processing it and return. If you grep the INP_DROPPED flag in
> kernel sources, you can see that this test is already done in almost all
> tcp_*() processing functions but tcp_input().
>
> I would say that even without this issue tcp_input() should check
> INP_DROPPED flags after INP_WLOCK anyway. Same for the TOE TCP stack,
> you are simply not supposed to process a inp with INP_DROPPED flag.

Absolutly acceptant!
May point is: more check and good handling of check result is best for
stability.

I.e. AND check INP_DROPPED in tcp_input AND workaroud INP_TIMEWAIT in
tcp_usr_detach (with logging) and check of some posible cases in XXX TOE.

Current TCP stack too complex and have many corner cases. This is need
additional guards where posible (not caused kernel panic).

Julien Charbon

unread,
Oct 12, 2016, 11:18:30 AM10/12/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
I see your point: Even if this issue is caught by this assert:

KASSERT(tp == NULL, ("tcp_detach: INP_TIMEWAIT && "
"INP_DROPPED && tp != NULL"));
https://github.com/freebsd/freebsd/blob/release/11.0.0/sys/netinet/tcp_usrreq.c#L213

you might not have INVARIANT option, then you will get a lockout quite
difficult to debug. Thus what we can do is:

- If INVARIANT is set: kernel panic to get all the details in the core.
- If INVARIANT is not set: Log this error with an explicit kernel
log(LOG_ERR) describing the issue, and then use the workaround to avoid
the double-free to let the system to good enough state.

Something like:

tcp_detach() {

...
if (inp->inp_flags & INP_TIMEWAIT) {

...
if (inp->inp_flags & INP_DROPPED) {

in_pcbdetach(inp);
if (__predict_true(tp == NULL)) {
in_pcbfree(inp);
} else {
#ifdef INVARIANTS
panic("tcp_detach: tp != NULL, That's not good because 'blah'\n");
#else
log(LOG_ERR, "tcp_detach: tp != NULL, That's not good because
'blah'\n");
#endif
INP_WUNLOCK(inp);
}
}
}

..

}

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 12, 2016, 11:42:59 AM10/12/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Wed, Oct 12, 2016 at 05:17:35PM +0200, Julien Charbon wrote:

> >>>>>>>> I see, thus just for the context: The TCP stack in sys/dev/cxgb* is a
> >>>>>>>> TOE (TCP Offload Engine?) TCP stack for Chelsio NICs, it is a
> >>>>>>>> separate/side TCP stack that is used only with TCP_OFFLOAD option.
> >>>>>>>>
> >>>>>>>> This TOE TCP stack actually has its own set of detach()/input()
> >>>>>>>> functions and seems to check INP_DROPPED flag properly. I guess @np
> >>>>>>>> check fixes in socket TCP stack and decides which one can also impact
> >>>>>>>> the Chelsio TOE TCP stack. Some bugs are only in socket TCP stack, some
> >>>>>>>> are only in TOE TCP stack.
> >>>>>>>
> >>>>>>> I am fear about other direction -- setting INP_TIMEWAIT in Chelsio TOE
> >>>>>>> TCP stack and impact this to
> >>>>>>> tcp_timer_2msl()/tcp_close()/sofree()/tcp_usr_detach() path.
> >>>>>>
> >>>>>> I see, I expect no problem on this side as tcp_timer_2msl() checks the
> >>>>>> INP_TIMEWAIT flag and do not call tcp_close() if set.
> >>>>>
> >>>>> I am about case when at time of first INP_WUNLOCK() tcp_timer_2msl()
> >>>>> don't see INP_TIMEWAIT, call tcp_close(), tcp_close() do INP_WUNLOCK()
> >>>>> and now Chelsio TOE take INP_WLOCK, do tcp_twstart() and set
> >>>>> INP_TIMEWAIT. After this tcp_timer_2msl resume and have unexpected
> >>>>> INP_TIMEWAIT in tcp_usr_detach().
> >>>>
> >>>> Sure, basically the same bug that in classic TCP stack. If you think
> >>>> it can happen, send an email describing that to np@ and he will check
> >>>> and fix that. He is a TOE TCP stack expert and I am not. In all cases,
> >>>> if this issue is possible in TOE TCP stack context, the patch will be
> >>>> straightforward: If the INP_DROPPED flag is set do not call tcp_twstart().

I am email to np@
Yes, thanks!

> tcp_detach() {
>
> ...
> if (inp->inp_flags & INP_TIMEWAIT) {
>
> ...
> if (inp->inp_flags & INP_DROPPED) {
>
> in_pcbdetach(inp);
> if (__predict_true(tp == NULL)) {
> in_pcbfree(inp);
> } else {
> #ifdef INVARIANTS
> panic("tcp_detach: tp != NULL, That's not good because 'blah'\n");
> #else
> log(LOG_ERR, "tcp_detach: tp != NULL, That's not good because
> 'blah'\n");

May be some more info in log can help to detect root cause of issuse?
I am don't know what info, may be flags or number of references?

> #endif
> INP_WUNLOCK(inp);
> }
> }
> }
>
> ...
>
> }
>
> --
> Julien

Navdeep Parhar

unread,
Oct 12, 2016, 1:32:53 PM10/12/16
to Julien Charbon, Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

> I see, thus just for the context: The TCP stack in sys/dev/cxgb* is a
> TOE (TCP Offload Engine?) TCP stack for Chelsio NICs, it is a
> separate/side TCP stack that is used only with TCP_OFFLOAD option.
>
> This TOE TCP stack actually has its own set of detach()/input()
> functions and seems to check INP_DROPPED flag properly. I guess @np
> check fixes in socket TCP stack and decides which one can also impact

Yes, I do keep an eye on the changes in the stack and keep the TOE
drivers up to date. The good part is that those drivers are trying to
do the exact same thing with the locks and various bits of state as the
software stack so they don't have any special requirements.

If any patch comes out of this discussion I'll update the TOE drivers
(if needed).

btw, if you're looking at TOE drivers then read the sys/dev/cxgbe/t4_tom
code (instead of the old T3 chip's cxgb/ulp/tom) as that's the most
actively maintained and tested.

Regards,
Navdeep

Julien Charbon

unread,
Oct 13, 2016, 8:02:41 AM10/13/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
Proposed changes added in the review:

https://reviews.freebsd.org/D8211

tell me when you have three days without issue with this change.

>> tcp_detach() {
>>
>> ...
>> if (inp->inp_flags & INP_TIMEWAIT) {
>>
>> ...
>> if (inp->inp_flags & INP_DROPPED) {
>>
>> in_pcbdetach(inp);
>> if (__predict_true(tp == NULL)) {
>> in_pcbfree(inp);
>> } else {
>> #ifdef INVARIANTS
>> panic("tcp_detach: tp != NULL, That's not good because 'blah'\n");
>> #else
>> log(LOG_ERR, "tcp_detach: tp != NULL, That's not good because
>> 'blah'\n");
>
> May be some more info in log can help to detect root cause of issuse?
> I am don't know what info, may be flags or number of references?

For this kind of issue, the useful part is the stacktrace. INVARIANT
will give you that trace in the core, and without INVARIANT then it is
better to use dtrace:

$ cat tcp-twstart-dropped.d
fbt::tcp_twstart:entry
/args[0]->t_inpcb->inp_flags & 0x04000000/
{
stack();
printf("INP_DROPPED in tcp_twstart: %x", args[0]->t_inpcb->inp_flags);
}

--
Julien


signature.asc

Slawa Olhovchenkov

unread,
Oct 13, 2016, 10:39:02 AM10/13/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Thu, Oct 13, 2016 at 01:56:21PM +0200, Julien Charbon wrote:

> >> Something like:
> >
> > Yes, thanks!
>
> Proposed changes added in the review:
>
> https://reviews.freebsd.org/D8211
>
> tell me when you have three days without issue with this change.
>
> >> tcp_detach() {
> >>
> >> ...
> >> if (inp->inp_flags & INP_TIMEWAIT) {
> >>
> >> ...
> >> if (inp->inp_flags & INP_DROPPED) {
> >>
> >> in_pcbdetach(inp);
> >> if (__predict_true(tp == NULL)) {
> >> in_pcbfree(inp);
> >> } else {
> >> #ifdef INVARIANTS
> >> panic("tcp_detach: tp != NULL, That's not good because 'blah'\n");
> >> #else
> >> log(LOG_ERR, "tcp_detach: tp != NULL, That's not good because
> >> 'blah'\n");
> >
> > May be some more info in log can help to detect root cause of issuse?
> > I am don't know what info, may be flags or number of references?
>
> For this kind of issue, the useful part is the stacktrace. INVARIANT

Like this?

#ifdef KDB
kdb_backtrace();
#endif

as found in sys/netgraph/ng_base.c

> will give you that trace in the core, and without INVARIANT then it is
> better to use dtrace:
>
> $ cat tcp-twstart-dropped.d
> fbt::tcp_twstart:entry
> /args[0]->t_inpcb->inp_flags & 0x04000000/
> {
> stack();
> printf("INP_DROPPED in tcp_twstart: %x", args[0]->t_inpcb->inp_flags);
> }

Same code may be insert there too, IMHO.

Julien Charbon

unread,
Oct 13, 2016, 11:06:56 AM10/13/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara

Hi Slawa,
It is overkill dtrace can do that.

>> will give you that trace in the core, and without INVARIANT then it is
>> better to use dtrace:
>>
>> $ cat tcp-twstart-dropped.d
>> fbt::tcp_twstart:entry
>> /args[0]->t_inpcb->inp_flags & 0x04000000/
>> {
>> stack();
>> printf("INP_DROPPED in tcp_twstart: %x", args[0]->t_inpcb->inp_flags);
>> }
>
> Same code may be insert there too, IMHO.

Hmm, I don't think so:

- If you have INVARIANT, the kernel will panic in tcp_twstart() or
tcp_detach() and you will have everything you need to debug.
- If you don't, dtrace is the right tool to use in all cases anyway.

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 13, 2016, 11:17:42 AM10/13/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On Thu, Oct 13, 2016 at 05:06:00PM +0200, Julien Charbon wrote:

> >> will give you that trace in the core, and without INVARIANT then it is
> >> better to use dtrace:
> >>
> >> $ cat tcp-twstart-dropped.d
> >> fbt::tcp_twstart:entry
> >> /args[0]->t_inpcb->inp_flags & 0x04000000/
> >> {
> >> stack();
> >> printf("INP_DROPPED in tcp_twstart: %x", args[0]->t_inpcb->inp_flags);
> >> }
> >
> > Same code may be insert there too, IMHO.
>
> Hmm, I don't think so:
>
> - If you have INVARIANT, the kernel will panic in tcp_twstart() or
> tcp_detach() and you will have everything you need to debug.
> - If you don't, dtrace is the right tool to use in all cases anyway.

dtrace don't executed in may case w/ diagnostic "dtrace: processing
aborted: Abort due to systemic unresponsiveness". This is for
tcp_close. May be tcp_twstart will be more successuful, may be not.
Also, using dtrace too complex in production (need complex startup
under screen and capture output) and for many peoples.
kdb_backtrace() have too less administrative overhead.

Julien Charbon

unread,
Oct 13, 2016, 12:22:23 PM10/13/16
to Slawa Olhovchenkov, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
On 10/13/16 5:17 PM, Slawa Olhovchenkov wrote:
> On Thu, Oct 13, 2016 at 05:06:00PM +0200, Julien Charbon wrote:
>
>>>> will give you that trace in the core, and without INVARIANT then it is
>>>> better to use dtrace:
>>>>
>>>> $ cat tcp-twstart-dropped.d
>>>> fbt::tcp_twstart:entry
>>>> /args[0]->t_inpcb->inp_flags & 0x04000000/
>>>> {
>>>> stack();
>>>> printf("INP_DROPPED in tcp_twstart: %x", args[0]->t_inpcb->inp_flags);
>>>> }
>>>
>>> Same code may be insert there too, IMHO.
>>
>> Hmm, I don't think so:
>>
>> - If you have INVARIANT, the kernel will panic in tcp_twstart() or
>> tcp_detach() and you will have everything you need to debug.
>> - If you don't, dtrace is the right tool to use in all cases anyway.
>
> dtrace don't executed in may case w/ diagnostic "dtrace: processing
> aborted: Abort due to systemic unresponsiveness". This is for
> tcp_close. May be tcp_twstart will be more successuful, may be not.

It does and will.

> Also, using dtrace too complex in production (need complex startup
> under screen and capture output) and for many peoples.
> kdb_backtrace() have too less administrative overhead.

I still think it is overkill. The main goal of this change is to fix a
quite tricky and old TCP stack locking issue. Let's try to do that
first, it is complex enough by itself.

Once the fix is validated and pushed, feel free to propose your own
patch/review to add kdb_backtrace(), log(), etc.. to get other devs
point of view.

I don't remember who said: "Never ever optimize error cases"...

--
Julien

signature.asc

Slawa Olhovchenkov

unread,
Oct 14, 2016, 5:36:15 AM10/14/16
to Julien Charbon, Konstantin Belousov, freebsd...@freebsd.org, hiren panchasara
This is not optimeze error cases, this is error recovery and
diagnostic of error cases in other subsystems.

Currently FreeBSD internals too complex for just always trust on
correct of other subsystem or do panic on any incosystency.

INVARIANTS too expensive now (20Gbit drops to 8Gbits).

PS: I am applay patch. Wait till monday.

Thanks very match for this hard work!
It is loading more messages.
0 new messages