[9fans] etherigbe.c using

Venkatesh Srinivas

unread,

Dec 8, 2009, 11:29:28 AM12/8/09

to

Hi,

I noticed etherigbe.c (in igberballoc) was recently changed to
increment the refcount on the block it allocates. Any reason it uses
_xinc rather than incref?

-- vs

erik quanstrom

unread,

Dec 8, 2009, 11:43:02 AM12/8/09

to

because it's not a Ref. unfortunately, if it were
a Ref, it would be much faster. _xinc is deadly
slow even if there is no contention on x86.

i wish the ref counting had at least been isolated to the
case that needs them. blocks in queues typically
have one owner. so the owner of the block assumes
it can modify the block whenever with no locking.
ref counting means this assumption is false.
i'm not sure how your supposed to wlock a block.

- erik

Russ Cox

unread,

Dec 8, 2009, 2:39:27 PM12/8/09

to

> because it's not a Ref. unfortunately, if it were
> a Ref, it would be much faster. _xinc is deadly
> slow even if there is no contention on x86.

do you have numbers to back up this claim?

you are claiming that the locked XCHGL
in tas (pc/l.s) called from lock (port/taslock.c)
called from incref (port/chan.c) is "much faster"
than the locked INCL in _xinc (pc/l.s).
it seems to me that a locked memory bus
is a locked memory bus.

also, when up != nil (a common condition),
lock does a locked INCL and DECL
(_xinc and _xdec) in addition to the tas,
which seems like strictly more work than
a single _xinc.

russ

John Floren

unread,

Dec 8, 2009, 2:55:09 PM12/8/09

to

On Tue, Dec 8, 2009 at 2:35 PM, Russ Cox <r...@swtch.com> wrote:
>> because it's not a Ref. unfortunately, if it were
>> a Ref, it would be much faster. _xinc is deadly
>> slow even if there is no contention on x86.
>
> do you have numbers to back up this claim?
>

I don't have the code or the numbers in front of me, but I recall
seeing quite a bit of speed improvement when I experimentally replaced
incref/decref with direct calls to _xinc/_xdec. I don't remember what
the test was, but I do remember that I got something like 35%
improvement on it. I ran that kernel on my terminal for the rest of
the summer without trouble; while I didn't notice a blazing speed
increase, it didn't slow me down either.

John
--
"Object-oriented design is the roman numerals of computing" -- Rob Pike

erik quanstrom

unread,

Dec 8, 2009, 3:03:51 PM12/8/09

to

> do you have numbers to back up this claim?
>
> you are claiming that the locked XCHGL
> in tas (pc/l.s) called from lock (port/taslock.c)
> called from incref (port/chan.c) is "much faster"
> than the locked INCL in _xinc (pc/l.s).
> it seems to me that a locked memory bus
> is a locked memory bus.

yes, i do. xinc on most modern intel is a real
loss. and a moderate loss on amd. my atom 330
is an exception.

intel core i7 2.4ghz
loop 0 nsec/call
loopxinc 20 nsec/call
looplock 11 nsec/call

intel 5000 1.6ghz
loop 0 nsec/call
loopxinc 44 nsec/call
looplock 25 nsec/call

intel atom 330 1.6ghz (exception!)
loop 2 nsec/call
loopxinc 14 nsec/call
looplock 22 nsec/call

amd k10 2.0ghz
loop 2 nsec/call
loopxinc 30 nsec/call
looplock 20 nsec/call

intel p4 xeon 3.0ghz

loop 1 nsec/call
loopxinc 76 nsec/call
looplock 42 nsec/call

- erik

xinc.s

timing.c

Russ Cox

unread,

Dec 8, 2009, 6:55:11 PM12/8/09

to

it looks like you are comparing these two functions

void
loopxinc(void)
{
uint i, x;

for(i = 0; i < N; i++){
_xinc(&x);
_xdec(&x);
}
}

void
looplock(void)
{
uint i;
static Lock l;

for(i = 0; i < N; i++){
lock(&l);
unlock(&l);
}
}

but the former does two operations and the latter
only one. your claim was that _xinc is slower
than incref (== lock(), x++, unlock()). but you are
timing xinc+xdec against incref.

assuming xinc and xdec are approximately the same
cost (so i can just halve the numbers for loopxinc),
that would make the fair comparison produce:

intel core i7 2.4ghz
loop 0 nsec/call

loopxinc 10 nsec/call // was 20
looplock 11 nsec/call

intel 5000 1.6ghz
loop 0 nsec/call

loopxinc 22 nsec/call // was 44
looplock 25 nsec/call

intel atom 330 1.6ghz (exception!)
loop 2 nsec/call

loopxinc 7 nsec/call // was 14
looplock 22 nsec/call

amd k10 2.0ghz
loop 2 nsec/call

loopxinc 15 nsec/call // was 30
looplock 20 nsec/call

intel p4 xeon 3.0ghz

loop 1 nsec/call
loopxinc 38 nsec/call // was 76
looplock 42 nsec/call

which looks like a much different story.

russ

erik quanstrom

unread,

Dec 8, 2009, 7:44:12 PM12/8/09

to

> but the former does two operations and the latter
> only one. your claim was that _xinc is slower
> than incref (== lock(), x++, unlock()). but you are
> timing xinc+xdec against incref.

sure. i was looking it as a kernel version of a
semaphore.

back to the original problem, before allocb/freeb
did 2 lock/unlocks. now it does 2 unlock/locks
+ 2 xinc/xdec, and is, in the best case 31% slower.
and in the worst case 90% slower. the reference
counting is a heavy price to pay on every network
block, when it is only used by ip/gre.c.

- erik

Russ Cox

unread,

Dec 8, 2009, 8:09:30 PM12/8/09

to

On Tue, Dec 8, 2009 at 4:32 PM, erik quanstrom <quan...@quanstro.net> wrote:
>> but the former does two operations and the latter
>> only one. your claim was that _xinc is slower
>> than incref (== lock(), x++, unlock()). but you are
>> timing xinc+xdec against incref.
>
> sure. i was looking it as a kernel version of a
> semaphore.

no, your original claim was that incref/decref
was faster than _xinc/_xdec. the numbers
don't support that claim.

> the reference
> counting is a heavy price to pay on every network
> block, when it is only used by ip/gre.c.

has the network gotten fast enough that an extra
bus transaction per block slows it down?
it seems like gigabit ethernet would be around
100k packets per second, so the extra 50ns
or so per packet would be 5ms per second in
practice, which is significantly but hardly
seems prohibitive.

> before allocb/freeb
> did 2 lock/unlocks. now it does 2 unlock/locks
> + 2 xinc/xdec, and is, in the best case 31% slower.
> and in the worst case 90% slower.

i don't know how you get those numbers but
anything even approaching that would mean that
the kernel is spending all its time in igberballoc,
at which point you probably have other things
to fix.

russ

erik quanstrom

unread,

Dec 8, 2009, 9:13:00 PM12/8/09

to

> has the network gotten fast enough that an extra
> bus transaction per block slows it down?
> it seems like gigabit ethernet would be around
> 100k packets per second, so the extra 50ns
> or so per packet would be 5ms per second in
> practice, which is significantly but hardly
> seems prohibitive.

i'm working with 10gbe. pcie 2.0 is making 2x10gbe
attractive. multiply by 10 or 20. and if you're doing a
request/response, multiply by 2 again.

- erik

[9fans] etherigbe.c using _xinc?

Venkatesh Srinivas

erik quanstrom

Russ Cox

John Floren

erik quanstrom

Russ Cox

erik quanstrom

Russ Cox

erik quanstrom