hackbench regression due to commit 9dfc6e68bfe6e

Alex Shi

unread,

Mar 25, 2010, 4:40:01 AM3/25/10

to

The hackbench benchmark dropped about 3~7% on our 2 sockets NHM machine
on 34-rc1 kernel. We find it is due to commit 9dfc6e68bfe6e,

commit 9dfc6e68bfe6ee452efb1a4e9ca26a9007f2b864
Author: Christoph Lameter <c...@linux-foundation.org>
Date: Fri Dec 18 16:26:20 2009 -0600

SLUB: Use this_cpu operations in slub

The hackbench is prepared hundreds pair of processes/threads. And each
of pair of processes consists of a receiver and a sender. After all
pairs created and ready with a few memory block (by malloc), hackbench
let the sender do appointed times sending to receiver via socket, then
wait all pairs finished. The total sending running time is the indicator
of this benchmark. The less the better.
The socket send/receiver generate lots of slub alloc/free. slabinfo
command show the following slub get huge increase from about 81412344 to
141412497, after command "backbench 150 thread 1000" running.

Name Objects Alloc Free %Fast Fallb O
:t-0001024 870 141412497 141412132 94 1 0 3
:t-0000256 1607 141225312 141224177 94 1 0 1

Via perf tool I collected the L1 data cache miss info of comamnd:
"./hackbench 150 thread 100"

On 33-rc1, about 1303976612 time L1 Dcache missing

On 9dfc6, about 1360574760 times L1 Dcache missing

I also disassemble the mm/built.o file, but seems no special change.

BRG
Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Lameter

unread,

Mar 25, 2010, 11:00:03 AM3/25/10

to

On Thu, 25 Mar 2010, Alex Shi wrote:

> SLUB: Use this_cpu operations in slub
>
> The hackbench is prepared hundreds pair of processes/threads. And each
> of pair of processes consists of a receiver and a sender. After all
> pairs created and ready with a few memory block (by malloc), hackbench
> let the sender do appointed times sending to receiver via socket, then
> wait all pairs finished. The total sending running time is the indicator
> of this benchmark. The less the better.

> The socket send/receiver generate lots of slub alloc/free. slabinfo
> command show the following slub get huge increase from about 81412344 to
> 141412497, after command "backbench 150 thread 1000" running.

The number of frees is different? From 81 mio to 141 mio? Are you sure it
was the same load?

> Name Objects Alloc Free %Fast Fallb O
> :t-0001024 870 141412497 141412132 94 1 0 3
> :t-0000256 1607 141225312 141224177 94 1 0 1
>
>
> Via perf tool I collected the L1 data cache miss info of comamnd:
> "./hackbench 150 thread 100"
>
> On 33-rc1, about 1303976612 time L1 Dcache missing
>
> On 9dfc6, about 1360574760 times L1 Dcache missing

I hope this is the same load?

What debugging options did you use? We are now using per cpu operations in
the hot paths. Enabling debugging for per cpu ops could decrease your
performance now. Have a look at a dissassembly of kfree() to verify that
there is no instrumentation.

Alex Shi

unread,

Mar 25, 2010, 10:40:01 PM3/25/10

to

On Thu, 2010-03-25 at 22:49 +0800, Christoph Lameter wrote:
> On Thu, 25 Mar 2010, Alex Shi wrote:
>
> > SLUB: Use this_cpu operations in slub
> >
> > The hackbench is prepared hundreds pair of processes/threads. And each
> > of pair of processes consists of a receiver and a sender. After all
> > pairs created and ready with a few memory block (by malloc), hackbench
> > let the sender do appointed times sending to receiver via socket, then
> > wait all pairs finished. The total sending running time is the indicator
> > of this benchmark. The less the better.
>
> > The socket send/receiver generate lots of slub alloc/free. slabinfo
> > command show the following slub get huge increase from about 81412344 to
> > 141412497, after command "backbench 150 thread 1000" running.
>
> The number of frees is different? From 81 mio to 141 mio? Are you sure it
> was the same load?

The slub free number has similar increase, the following is the data
before testing:
name Objects Alloc Free %Fast Fallb Onn
:t-0001024 855 81412344 81411981 93 1 0 3
:t-0000256 1540 81224970 81223835 93 1 0 1

I am sure there is no effective task running when I do testing.

Just for this info, CONFIG_SLUB_STATS enabled.

>
> > Name Objects Alloc Free %Fast Fallb O
> > :t-0001024 870 141412497 141412132 94 1 0 3
> > :t-0000256 1607 141225312 141224177 94 1 0 1
> >
> >
> > Via perf tool I collected the L1 data cache miss info of comamnd:
> > "./hackbench 150 thread 100"
> >
> > On 33-rc1, about 1303976612 time L1 Dcache missing
> >
> > On 9dfc6, about 1360574760 times L1 Dcache missing
>
> I hope this is the same load?

for the same load parameter: ./hackbench 150 thread 1000
on 33-rc1, about 10649258360 times L1 Dcache missing
on 9dfc6, about 11061002507 times L1 Dcahce missing

For this this info, without CONFIG_SLUB_STATS and slub_debug is close.

>
> What debugging options did you use? We are now using per cpu operations in
> the hot paths. Enabling debugging for per cpu ops could decrease your
> performance now. Have a look at a dissassembly of kfree() to verify that
> there is no instrumentation.
>

Basically, slub_debug never opened in booting, some SLUB related kernel
config is here:
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
#CONFIG_SLUB_DEBUG_ON is not set

I just dissemble kfree, but whether the KMEMTRACE enabled or not, the
trace_kfree code stay in kfree function, and in my testing the debugfs
are not mounted.

Zhang, Yanmin

unread,

Apr 1, 2010, 5:30:02 AM4/1/10

to

Christoph,

I suspect the moving of place of cpu_slab in kmem_cache causes the new cache
miss. But when I move it to the tail of the structure, kernel always panic when
booting. Perhaps there is another potential bug?

---
Mount-cache hash table entries: 256
general protection fault: 0000 [#1] SMP
last sysfs file:
CPU 0
Pid: 0, comm: swapper Not tainted 2.6.33-rc1-this_cpu #1 X8DTN/X8DTN
RIP: 0010:[<ffffffff810c5041>] [<ffffffff810c5041>] kmem_cache_alloc+0x58/0xf7
RSP: 0000:ffffffff81a01df8 EFLAGS: 00010083
RAX: ffff8800bec02220 RBX: ffffffff81c19180 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00000000000006ae RDI: ffffffff818031ee
RBP: ffff8800bec02000 R08: ffff1000e6e02220 R09: 0000000000000002
R10: ffff88000001b9f0 R11: ffff88000001baf8 R12: 00000000000080d0
R13: 0000000000000296 R14: 00000000000080d0 R15: ffffffff8126b0be
FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001a55000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a5d020)
Stack:
0000000000000010 ffffffff81a01e20 ffff880100002038 ffffffff81c19180
<0> 00000000000080d0 ffffffff81c19198 0000000000400000 ffffffff81836aca
<0> 0000000000000000 ffffffff8126b0be 0000000000000296 00000000000000d0
Call Trace:
[<ffffffff8126b0be>] ? idr_pre_get+0x29/0x6d
[<ffffffff8126b116>] ? ida_pre_get+0x14/0xba
[<ffffffff810e19a1>] ? alloc_vfsmnt+0x3c/0x166
[<ffffffff810cdd0e>] ? vfs_kern_mount+0x32/0x15b
[<ffffffff81b22c41>] ? sysfs_init+0x55/0xae
[<ffffffff81b21ce1>] ? mnt_init+0x9b/0x179
[<ffffffff81b2194e>] ? vfs_caches_init+0x105/0x115
[<ffffffff81b07c03>] ? start_kernel+0x32e/0x370

Christoph Lameter

unread,

Apr 1, 2010, 12:00:03 PM4/1/10

to

On Thu, 1 Apr 2010, Zhang, Yanmin wrote:

> I suspect the moving of place of cpu_slab in kmem_cache causes the new cache
> miss. But when I move it to the tail of the structure, kernel always panic when
> booting. Perhaps there is another potential bug?

Why would that cause an additional cache miss?

The node array is following at the end of the structure. If you want to
move it down then it needs to be placed before the node field.

Zhang, Yanmin

unread,

Apr 2, 2010, 4:10:02 AM4/2/10

to

On Thu, 2010-04-01 at 10:53 -0500, Christoph Lameter wrote:
> On Thu, 1 Apr 2010, Zhang, Yanmin wrote:
>
> > I suspect the moving of place of cpu_slab in kmem_cache causes the new cache
> > miss. But when I move it to the tail of the structure, kernel always panic when
> > booting. Perhaps there is another potential bug?
>
> Why would that cause an additional cache miss?
>
>
> The node array is following at the end of the structure. If you want to
> move it down then it needs to be placed before the node field

Thanks. The moving cpu_slab to tail doesn't improve it.

I used perf to collect statistics. Only data cache miss has a little difference.
My testing command on my 2 socket machine:
#hackbench 100 process 20000

With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree)
takes for about 101 seconds.

perf shows some functions around SLUB have more cpu utilization, while some other
SLUB functions have less cpu utilization.

Christoph Lameter

unread,

Apr 5, 2010, 10:00:01 AM4/5/10

to

On Fri, 2 Apr 2010, Zhang, Yanmin wrote:

> My testing command on my 2 socket machine:
> #hackbench 100 process 20000
>
> With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree)
> takes for about 101 seconds.
>
> perf shows some functions around SLUB have more cpu utilization, while some other
> SLUB functions have less cpu utilization.

Hmnmmm... The dynamic percpu areas use page tables and that data is used
in the fast path. Maybe the high thread count causes tlb trashing?

Pekka Enberg

unread,

Apr 5, 2010, 1:40:02 PM4/5/10

to

(I'm CC'ing Tejun)

On Mon, Apr 5, 2010 at 4:54 PM, Christoph Lameter
<c...@linux-foundation.org> wrote:
> On Fri, 2 Apr 2010, Zhang, Yanmin wrote:
>
>> My testing command on my 2 socket machine:
>> #hackbench 100 process 20000
>>
>> With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree)
>> takes for about 101 seconds.
>>
>> perf shows some functions around SLUB have more cpu utilization, while some other
>> SLUB functions have less cpu utilization.
>
> Hmnmmm... The dynamic percpu areas use page tables and that data is used
> in the fast path. Maybe the high thread count causes tlb trashing?

Hmm indeed. I don't see anything particularly funny in the SLUB percpu
conversion so maybe this is a more issue with the new percpu
allocator?

Tejun Heo

unread,

Apr 5, 2010, 9:30:01 PM4/5/10

to

Hello,

On 04/06/2010 02:30 AM, Pekka Enberg wrote:
>> Hmnmmm... The dynamic percpu areas use page tables and that data is used
>> in the fast path. Maybe the high thread count causes tlb trashing?
>
> Hmm indeed. I don't see anything particularly funny in the SLUB percpu
> conversion so maybe this is a more issue with the new percpu
> allocator?

By default, percpu allocator embeds the first chunk in the kernel
linear mapping and accesses there shouldn't involve any TLB overhead.
From the second chunk on, they're mapped page-by-page into vmalloc
area. This can be updated to use larger page mapping but 2M page
per-cpu is pretty large and the trade off hasn't been right yet.

The amount reserved for dynamic allocation in the first chunk is
determined by PERCPU_DYNAMIC_RESERVE constant in
include/linux/percpu.h. It's currently 20k on 64bit machines and 12k
on 32bit. The intention was to size this such that most common stuff
is allocated from this area. The 20k and 12k are numbers that I
pulled out of my ass :-) with the custom config I used. Now that more
stuff has been converted to dynamic percpu, it's quite possible that
the area is too small. Can you please try to increase the size of the
area (say 2 or 4 times) and see whether the performance regression
goes away?

Thanks.

--
tejun

Zhang, Yanmin

unread,

Apr 6, 2010, 4:30:03 AM4/6/10

to

On Tue, 2010-04-06 at 10:27 +0900, Tejun Heo wrote:
> Hello,
>
> On 04/06/2010 02:30 AM, Pekka Enberg wrote:
> >> Hmnmmm... The dynamic percpu areas use page tables and that data is used
> >> in the fast path. Maybe the high thread count causes tlb trashing?
> >
> > Hmm indeed. I don't see anything particularly funny in the SLUB percpu
> > conversion so maybe this is a more issue with the new percpu
> > allocator?
>
> By default, percpu allocator embeds the first chunk in the kernel
> linear mapping and accesses there shouldn't involve any TLB overhead.
> >From the second chunk on, they're mapped page-by-page into vmalloc
> area. This can be updated to use larger page mapping but 2M page
> per-cpu is pretty large and the trade off hasn't been right yet.
>
> The amount reserved for dynamic allocation in the first chunk is
> determined by PERCPU_DYNAMIC_RESERVE constant in
> include/linux/percpu.h. It's currently 20k on 64bit machines and 12k
> on 32bit. The intention was to size this such that most common stuff
> is allocated from this area. The 20k and 12k are numbers that I
> pulled out of my ass :-) with the custom config I used. Now that more
> stuff has been converted to dynamic percpu, it's quite possible that
> the area is too small. Can you please try to increase the size of the
> area (say 2 or 4 times) and see whether the performance regression
> goes away?

Thanks. I tried 2 and 4 times and didn't see much improvement.
I checked /proc/vamallocinfo and it doesn't have item of pcpu_get_vm_areas
when I use 4 times of PERCPU_DYNAMIC_RESERVE.

I used perf to collect dtlb misses and LLC misses. dtlb miss data is not
stable. Sometimes, we have a bigger dtlb miss, but get a better result.

LLC misses data are more stable. Only LLC-load-misses is the clear sign now.
LLC-store-misses has no big difference.

Christoph Lameter

unread,

Apr 6, 2010, 11:50:02 AM4/6/10

to

On Tue, 6 Apr 2010, Zhang, Yanmin wrote:

> Thanks. I tried 2 and 4 times and didn't see much improvement.
> I checked /proc/vamallocinfo and it doesn't have item of pcpu_get_vm_areas
> when I use 4 times of PERCPU_DYNAMIC_RESERVE.

> I used perf to collect dtlb misses and LLC misses. dtlb miss data is not
> stable. Sometimes, we have a bigger dtlb miss, but get a better result.
>
> LLC misses data are more stable. Only LLC-load-misses is the clear sign now.
> LLC-store-misses has no big difference.

LLC-load-miss is exactly what condition?

The cacheline environment in the hotpath should only include the following
cache lines (without debugging and counters):

1. The first cacheline from the kmem_cache structure

(This is different from the sitation before the 2.6.34 changes. Earlier
some critical values (object length etc) where available
from the kmem_cache_cpu structure. The cacheline containing the percpu
structure array was needed to determome the kmem_cache_cpu address!)

2. The first cacheline from kmem_cache_cpu

3. The first cacheline of the data object (free pointer)

And in case of a kfree/ kmem_cache_free:

4. Cacheline that contains the page struct of the page the object resides
in.

Can you post the .config you are using and the bootup messages?

Christoph Lameter

unread,

Apr 6, 2010, 5:00:04 PM4/6/10

to

We cannot reproduce the issue here. Our tests here (dual quad dell) show a
performance increase in hackbench instead.

Linux 2.6.33.2 #2 SMP Mon Apr 5 11:30:56 CDT 2010 x86_64 GNU/Linux
./hackbench 100 process 200000
Running with 100*40 (== 4000) tasks.
Time: 3102.142
./hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 308.731
./hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 311.591
./hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 310.200
./hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 38.048
./hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 44.711
./hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 39.407
./hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 9.411
./hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 8.765
./hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 8.822

Linux 2.6.34-rc3 #1 SMP Tue Apr 6 13:30:34 CDT 2010 x86_64 GNU/Linux
./hackbench 100 process 200000
Running with 100*40 (== 4000) tasks.
Time: 3003.578
./hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 300.289
./hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 301.462
./hackbench 100 process 20000
Running with 100*40 (== 4000) tasks.
Time: 301.173
./hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 41.191
./hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 41.964
./hackbench 10 process 20000
Running with 10*40 (== 400) tasks.
Time: 41.470
./hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 8.829
./hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 9.166
./hackbench 1 process 20000
Running with 1*40 (== 40) tasks.
Time: 8.681

Eric Dumazet

unread,

Apr 6, 2010, 6:20:01 PM4/6/10

to

Well, your config might be very different... and hackbench results can
vary by 10% on same machine, same kernel.

This is not a reliable bench, because af_unix is not prepared to get
such a lazy workload.

We really should warn people about this.

# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 12.922
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 12.696
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.060
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 14.108
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.165
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.310
# hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 12.530

booting with slub_min_order=3 do change hackbench results for example ;)

All writers can compete on spinlock for a target UNIX socket, we spend _lot_ of time spinning.

If we _really_ want to speedup hackbench, we would have to change unix_state_lock()
to use a non spinning locking primitive (aka lock_sock()), and slowdown normal path.

# perf record -f hackbench 25 process 3000
Running with 25*40 (== 1000) tasks.
Time: 13.330
[ perf record: Woken up 289 times to write data ]
[ perf record: Captured and wrote 54.312 MB perf.data (~2372928 samples) ]
# perf report
# Samples: 2370135
#
# Overhead Command Shared Object Symbol
# ........ ......... ............................ ......
#
9.68% hackbench [kernel] [k] do_raw_spin_lock
6.50% hackbench [kernel] [k] schedule
4.38% hackbench [kernel] [k] __kmalloc_track_caller
3.95% hackbench [kernel] [k] copy_to_user
3.86% hackbench [kernel] [k] __alloc_skb
3.77% hackbench [kernel] [k] unix_stream_recvmsg
3.12% hackbench [kernel] [k] sock_alloc_send_pskb
2.75% hackbench [vdso] [.] 0x000000ffffe425
2.28% hackbench [kernel] [k] sysenter_past_esp
2.03% hackbench [kernel] [k] __mutex_lock_common
2.00% hackbench [kernel] [k] kfree
2.00% hackbench [kernel] [k] delay_tsc
1.75% hackbench [kernel] [k] update_curr
1.70% hackbench [kernel] [k] kmem_cache_alloc
1.69% hackbench [kernel] [k] do_raw_spin_unlock
1.60% hackbench [kernel] [k] unix_stream_sendmsg
1.54% hackbench [kernel] [k] sched_clock_local
1.46% hackbench [kernel] [k] __slab_free
1.37% hackbench [kernel] [k] do_raw_read_lock
1.34% hackbench [kernel] [k] __switch_to
1.24% hackbench [kernel] [k] select_task_rq_fair
1.23% hackbench [kernel] [k] sock_wfree
1.21% hackbench [kernel] [k] _raw_spin_unlock_irqrestore
1.19% hackbench [kernel] [k] __mutex_unlock_slowpath
1.05% hackbench [kernel] [k] trace_hardirqs_off
0.99% hackbench [kernel] [k] __might_sleep
0.93% hackbench [kernel] [k] do_raw_read_unlock
0.93% hackbench [kernel] [k] _raw_spin_lock
0.91% hackbench [kernel] [k] try_to_wake_up
0.81% hackbench [kernel] [k] sched_clock
0.80% hackbench [kernel] [k] trace_hardirqs_on

Zhang, Yanmin

unread,

Apr 6, 2010, 9:00:01 PM4/6/10

to

On Tue, 2010-04-06 at 10:41 -0500, Christoph Lameter wrote:
> On Tue, 6 Apr 2010, Zhang, Yanmin wrote:
>
> > Thanks. I tried 2 and 4 times and didn't see much improvement.
> > I checked /proc/vamallocinfo and it doesn't have item of pcpu_get_vm_areas
> > when I use 4 times of PERCPU_DYNAMIC_RESERVE.
>
> > I used perf to collect dtlb misses and LLC misses. dtlb miss data is not
> > stable. Sometimes, we have a bigger dtlb miss, but get a better result.
> >
> > LLC misses data are more stable. Only LLC-load-misses is the clear sign now.
> > LLC-store-misses has no big difference.
>
> LLC-load-miss is exactly what condition?

I don't know. I just said it's a clear sign. Otherwise, there is no clear sign.
The function statistics collected by perf with event llc-load-misses are very
scattered.

>
> The cacheline environment in the hotpath should only include the following
> cache lines (without debugging and counters):
>
> 1. The first cacheline from the kmem_cache structure
>
> (This is different from the sitation before the 2.6.34 changes. Earlier
> some critical values (object length etc) where available
> from the kmem_cache_cpu structure. The cacheline containing the percpu
> structure array was needed to determome the kmem_cache_cpu address!)
>
> 2. The first cacheline from kmem_cache_cpu
>
> 3. The first cacheline of the data object (free pointer)
>
> And in case of a kfree/ kmem_cache_free:
>
> 4. Cacheline that contains the page struct of the page the object resides
> in.

I agree with your analysis, but we still have no answer.

>
> Can you post the .config you are using and the bootup messages?
>

Pls. see the 2 attachment.

CONFIG_SLUB_STATS has no big impact on results.

Yanmin

2.6.34-rc3-slubtj.dmesg

2.6.34-rc3_config

Zhang, Yanmin

unread,

Apr 6, 2010, 10:20:01 PM4/6/10

to

On Tue, 2010-04-06 at 15:55 -0500, Christoph Lameter wrote:
> We cannot reproduce the issue here. Our tests here (dual quad dell) show a
> performance increase in hackbench instead.

I run hackbench on many machines. The regression exists on Nehalem machine
(dual sockets, 2*4*2 logical cpu) and a tigerton (4 socket, 4*4 logical cpu)
machines. I tried it on a dual quad core2 machine and it does like what you said.

The regression also exists on 2 new-generation Nehalem (dual socket 2*6*2 logical cpu)
machines.

It seems hyperthreading cpu has more chances to trigger it.

Zhang, Yanmin

unread,

Apr 6, 2010, 10:40:01 PM4/6/10

to

Thanks. I also found that. Normally, my script runs hackbench for 3 times and
gets an average value. To decrease the variation, I use
'./hackbench 100 process 200000' to get a more stable result.

>
> We really should warn people about this.
>
>
>
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 12.922
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 12.696
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.060
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 14.108
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.165
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 13.310
> # hackbench 25 process 3000
> Running with 25*40 (== 1000) tasks.
> Time: 12.530
>
>
> booting with slub_min_order=3 do change hackbench results for example ;)

By default, slub_min_order=3 on my Nehalem machines. I also tried different
larger slub_min_order and didn't find help.

I collected retired instruction, dtlb miss and LLC miss.
Below is data of LLC miss.

Kernel 2.6.33:
# Samples: 11639436896 LLC-load-misses

#
# Overhead Command Shared Object Symbol

# ........ ............... ...................................................... ......
#
20.94% hackbench [kernel.kallsyms] [k] copy_user_generic_string
14.56% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
12.88% hackbench [kernel.kallsyms] [k] kfree
7.37% hackbench [kernel.kallsyms] [k] kmem_cache_free
7.18% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
6.78% hackbench [kernel.kallsyms] [k] kfree_skb
6.27% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_caller
2.73% hackbench [kernel.kallsyms] [k] __slab_free
2.21% hackbench [kernel.kallsyms] [k] get_partial_node
2.01% hackbench [kernel.kallsyms] [k] _raw_spin_lock
1.59% hackbench [kernel.kallsyms] [k] schedule
1.27% hackbench hackbench [.] receiver
0.99% hackbench libpthread-2.9.so [.] __read
0.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg

Kernel 2.6.34-rc3:
# Samples: 13079611308 LLC-load-misses

#
# Overhead Command Shared Object Symbol

# ........ ............... .................................................................... ......
#
18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str
ing
13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
11.62% hackbench [kernel.kallsyms] [k] kfree
8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free
7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_
caller
6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
5.94% hackbench [kernel.kallsyms] [k] kfree_skb
3.48% hackbench [kernel.kallsyms] [k] __slab_free
2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock
1.83% hackbench [kernel.kallsyms] [k] schedule
1.82% hackbench [kernel.kallsyms] [k] get_partial_node
1.59% hackbench hackbench [.] receiver
1.37% hackbench libpthread-2.9.so [.] __read

Eric Dumazet

unread,

Apr 7, 2010, 2:40:02 AM4/7/10

to

Please check values of /proc/sys/net/core/rmem_default
and /proc/sys/net/core/wmem_default on your machines.

Their values can also change hackbench results, because increasing
wmem_default allows af_unix senders to consume much more skbs and stress
slab allocators (__slab_free), way beyond slub_min_order can tune them.

When 2000 senders are running (and 2000 receivers), we might consume
something like 2000 * 100.000 bytes of kernel memory for skbs. TLB
trashing is expected, because all these skbs can span many 2MB pages.
Maybe some node imbalance happens too.

You could try to boot your machine with less ram per node and check :

# cat /proc/buddyinfo
Node 0, zone DMA 2 1 2 2 1 1 1 0 1 1 3
Node 0, zone DMA32 219 298 143 584 145 57 44 41 31 26 517
Node 1, zone DMA32 4 1 17 1 0 3 2 2 2 2 123
Node 1, zone Normal 126 169 83 8 7 5 59 59 49 28 459

One experiment on your Nehalem machine would be to change hackbench so
that each group (20 senders/ 20 receivers) run on a particular NUMA
node.

x86info -c ->

CPU #1
EFamily: 0 EModel: 1 Family: 6 Model: 26 Stepping: 5
CPU Model: Core i7 (Nehalem)
Processor name string: Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
Type: 0 (Original OEM) Brand: 0 (Unsupported)
Number of cores per physical package=8
Number of logical processors per socket=16
Number of logical processors per core=2
APIC ID: 0x10 Package: 0 Core: 1 SMT ID 0
Cache info
L1 Instruction cache: 32KB, 4-way associative. 64 byte line size.
L1 Data cache: 32KB, 8-way associative. 64 byte line size.
L2 (MLC): 256KB, 8-way associative. 64 byte line size.
TLB info
Data TLB: 4KB pages, 4-way associative, 64 entries
64 byte prefetching.
Found unknown cache descriptors: 55 5a b2 ca e4

Zhang, Yanmin

unread,

Apr 7, 2010, 5:10:02 AM4/7/10

to

It's a good pointer. rmem_default and wmem_default are about 116k on my machine.
I changed them to 52K and it seems there is no improvement.

>
>
>
> You could try to boot your machine with less ram per node and check :
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 2 1 2 2 1 1 1 0 1 1 3
> Node 0, zone DMA32 219 298 143 584 145 57 44 41 31 26 517
> Node 1, zone DMA32 4 1 17 1 0 3 2 2 2 2 123
> Node 1, zone Normal 126 169 83 8 7 5 59 59 49 28 459
>
>
> One experiment on your Nehalem machine would be to change hackbench so
> that each group (20 senders/ 20 receivers) run on a particular NUMA
> node.

I expect process scheduler to work well in scheduling different groups
to different nodes.

I suspected dynamic percpu data didn't take care of NUMA, but kernel dump shows
it does take care of NUMA.

Eric Dumazet

unread,

Apr 7, 2010, 5:30:02 AM4/7/10

to

Le mercredi 07 avril 2010 à 17:07 +0800, Zhang, Yanmin a écrit :
> >
> > One experiment on your Nehalem machine would be to change hackbench so
> > that each group (20 senders/ 20 receivers) run on a particular NUMA
> > node.
> I expect process scheduler to work well in scheduling different groups
> to different nodes.
>
> I suspected dynamic percpu data didn't take care of NUMA, but kernel dump shows
> it does take care of NUMA.
>

hackbench allocates all unix sockets on one single node, then
forks/spans its children.

Thats huge node imbalance.

You can see this with lsof on a running hackbench :

# lsof -p 14802
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
hackbench 14802 root cwd DIR 104,7 4096 12927240 /data/src/linux-2.6
hackbench 14802 root rtd DIR 104,2 4096 2 /
hackbench 14802 root txt REG 104,2 17524 697317 /usr/bin/hackbench
hackbench 14802 root mem REG 104,2 112212 558042 /lib/ld-2.3.4.so
hackbench 14802 root mem REG 104,2 1547588 558043 /lib/tls/libc-2.3.4.so
hackbench 14802 root mem REG 104,2 107928 557058 /lib/tls/libpthread-2.3.4.so
hackbench 14802 root mem REG 0,0 0 [heap] (stat: No such file or directory)
hackbench 14802 root 0u CHR 136,0 3 /dev/pts/0
hackbench 14802 root 1u CHR 136,0 3 /dev/pts/0
hackbench 14802 root 2u CHR 136,0 3 /dev/pts/0
hackbench 14802 root 3u unix 0xffff8800ac0da100 28939 socket
hackbench 14802 root 4u unix 0xffff8800ac0da400 28940 socket
hackbench 14802 root 5u unix 0xffff8800ac0da700 28941 socket
hackbench 14802 root 6u unix 0xffff8800ac0daa00 28942 socket
hackbench 14802 root 8u unix 0xffff8800aeac1800 28984 socket
hackbench 14802 root 9u unix 0xffff8800aeac1e00 28986 socket
hackbench 14802 root 10u unix 0xffff8800aeac2400 28988 socket
hackbench 14802 root 11u unix 0xffff8800aeac2a00 28990 socket
hackbench 14802 root 12u unix 0xffff8800aeac3000 28992 socket
hackbench 14802 root 13u unix 0xffff8800aeac3600 28994 socket
hackbench 14802 root 14u unix 0xffff8800aeac3c00 28996 socket
hackbench 14802 root 15u unix 0xffff8800aeac4200 28998 socket
hackbench 14802 root 16u unix 0xffff8800aeac4800 29000 socket
hackbench 14802 root 17u unix 0xffff8800aeac4e00 29002 socket
hackbench 14802 root 18u unix 0xffff8800aeac5400 29004 socket
hackbench 14802 root 19u unix 0xffff8800aeac5a00 29006 socket
hackbench 14802 root 20u unix 0xffff8800aeac6000 29008 socket
hackbench 14802 root 21u unix 0xffff8800aeac6600 29010 socket
hackbench 14802 root 22u unix 0xffff8800aeac6c00 29012 socket
hackbench 14802 root 23u unix 0xffff8800aeac7200 29014 socket
hackbench 14802 root 24u unix 0xffff8800aeac0f00 29016 socket
hackbench 14802 root 25u unix 0xffff8800aeac0900 29018 socket
hackbench 14802 root 26u unix 0xffff8800aeac7b00 29020 socket
hackbench 14802 root 27u unix 0xffff8800aeac7500 29022 socket

All sockets structures (where all _hot_ locks reside) are on a single node.

Pekka Enberg

unread,

Apr 7, 2010, 6:50:01 AM4/7/10

to

Zhang, Yanmin kirjoitti:

> Kernel 2.6.34-rc3:
> # Samples: 13079611308 LLC-load-misses
> #
> # Overhead Command Shared Object Symbol
> # ........ ............... .................................................................... ......
> #
> 18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str
> ing
> 13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> 11.62% hackbench [kernel.kallsyms] [k] kfree
> 8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free
> 7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_
> caller
> 6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> 5.94% hackbench [kernel.kallsyms] [k] kfree_skb
> 3.48% hackbench [kernel.kallsyms] [k] __slab_free
> 2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> 1.83% hackbench [kernel.kallsyms] [k] schedule
> 1.82% hackbench [kernel.kallsyms] [k] get_partial_node
> 1.59% hackbench hackbench [.] receiver
> 1.37% hackbench libpthread-2.9.so [.] __read

Btw, you might want to try out "perf record -g" and "perf report
--callchain fractal,5" to get a better view of where we're spending
time. Perhaps you can spot the difference with that more easily.

Christoph Lameter

unread,

Apr 7, 2010, 12:40:02 PM4/7/10

to

On Wed, 7 Apr 2010, Zhang, Yanmin wrote:

> > booting with slub_min_order=3 do change hackbench results for example ;)
> By default, slub_min_order=3 on my Nehalem machines. I also tried different
> larger slub_min_order and didn't find help.

Lets stop fiddling with kernel command line parameters for these test.
Leave as default. That is how I tested.

Christoph Lameter

unread,

Apr 7, 2010, 12:50:02 PM4/7/10

to

On Wed, 7 Apr 2010, Zhang, Yanmin wrote:

> I collected retired instruction, dtlb miss and LLC miss.
> Below is data of LLC miss.
>
> Kernel 2.6.33:

> 20.94% hackbench [kernel.kallsyms] [k] copy_user_generic_string
> 14.56% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> 12.88% hackbench [kernel.kallsyms] [k] kfree
> 7.37% hackbench [kernel.kallsyms] [k] kmem_cache_free
> 7.18% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> 6.78% hackbench [kernel.kallsyms] [k] kfree_skb
> 6.27% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_caller
> 2.73% hackbench [kernel.kallsyms] [k] __slab_free
> 2.21% hackbench [kernel.kallsyms] [k] get_partial_node
> 2.01% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> 1.59% hackbench [kernel.kallsyms] [k] schedule
> 1.27% hackbench hackbench [.] receiver
> 0.99% hackbench libpthread-2.9.so [.] __read
> 0.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg
>
> Kernel 2.6.34-rc3:

> 18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str
> ing
> 13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg
> 11.62% hackbench [kernel.kallsyms] [k] kfree
> 8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free
> 7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_
> caller

Seems that the overhead of __kmalloc_node_track_caller was increased. The
function inlines slab_alloc().

> 6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node
> 5.94% hackbench [kernel.kallsyms] [k] kfree_skb
> 3.48% hackbench [kernel.kallsyms] [k] __slab_free
> 2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock
> 1.83% hackbench [kernel.kallsyms] [k] schedule
> 1.82% hackbench [kernel.kallsyms] [k] get_partial_node
> 1.59% hackbench hackbench [.] receiver
> 1.37% hackbench libpthread-2.9.so [.] __read

I wonder if this is not related to the kmem_cache_cpu structure straggling
cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu
structure was larger and therefore tight packing resulted in different
alignment.

Could you see how the following patch affects the results. It attempts to
increase the size of kmem_cache_cpu to a power of 2 bytes. There is also
the potential that other per cpu fetches to neighboring objects affect the
situation. We could cacheline align the whole thing.

---
include/linux/slub_def.h | 5 +++++
1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2010-04-07 11:33:50.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2010-04-07 11:35:18.000000000 -0500
@@ -38,6 +38,11 @@ struct kmem_cache_cpu {
void **freelist; /* Pointer to first free per cpu object */
struct page *page; /* The slab from which we are allocating */
int node; /* The node of the page (or -1 for debug) */
+#ifndef CONFIG_64BIT
+ int dummy1;
+#endif
+ unsigned long dummy2;
+
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif

Pekka Enberg

unread,

Apr 7, 2010, 1:00:02 PM4/7/10

to

Pekka Enberg wrote:

> Would __cacheline_aligned_in_smp do the trick here?

Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with
four underscores) for per-cpu data. Confusing...

Pekka Enberg

unread,

Apr 7, 2010, 1:00:03 PM4/7/10

to

Would __cacheline_aligned_in_smp do the trick here?

Christoph Lameter

unread,

Apr 7, 2010, 2:20:02 PM4/7/10

to

This is allocated via the percpu allocator. We could specify cacheline
alignment there but that would reduce the density. You basically need 4
words for a kmem_cache_cpu structure. A number of those fit into one 64
byte cacheline.

Pekka Enberg

unread,

Apr 7, 2010, 2:30:02 PM4/7/10

to

Christoph Lameter wrote:

> On Wed, 7 Apr 2010, Pekka Enberg wrote:
>
>> Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
>> underscores) for per-cpu data. Confusing...
>

> This does not particulary help to clarify the situation since we are
> dealing with data that can either be allocated via the percpu allocator or
> be statically present (kmalloc bootstrap situation).

Yes, I am an idiot. :-)

Christoph Lameter

unread,

Apr 7, 2010, 2:30:02 PM4/7/10

to

On Wed, 7 Apr 2010, Pekka Enberg wrote:

> Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> underscores) for per-cpu data. Confusing...

This does not particulary help to clarify the situation since we are

dealing with data that can either be allocated via the percpu allocator or
be statically present (kmalloc bootstrap situation).

--

Eric Dumazet

unread,

Apr 7, 2010, 2:40:02 PM4/7/10

to

Le mercredi 07 avril 2010 à 13:20 -0500, Christoph Lameter a écrit :
> On Wed, 7 Apr 2010, Pekka Enberg wrote:
>
> > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> > underscores) for per-cpu data. Confusing...
>
> This does not particulary help to clarify the situation since we are
> dealing with data that can either be allocated via the percpu allocator or
> be statically present (kmalloc bootstrap situation).
>
> --

Do we have a user program to check actual L1 cache size of a machine ?

I remember my HP blades have many BIOS options, I would like to make
sure they are properly set.

Christoph Lameter

unread,

Apr 7, 2010, 3:40:01 PM4/7/10

to

On Wed, 7 Apr 2010, Pekka Enberg wrote:

> Yes, I am an idiot. :-)

Plato said it in another way:

"As for me, all I know is that I know nothing."

Zhang, Yanmin

unread,

Apr 7, 2010, 9:10:01 PM4/7/10

to

On Wed, 2010-04-07 at 20:38 +0200, Eric Dumazet wrote:
> Le mercredi 07 avril 2010 à 13:20 -0500, Christoph Lameter a écrit :
> > On Wed, 7 Apr 2010, Pekka Enberg wrote:
> >
> > > Oh, sorry, I think it's actually '____cacheline_aligned_in_smp' (with four
> > > underscores) for per-cpu data. Confusing...
> >
> > This does not particulary help to clarify the situation since we are
> > dealing with data that can either be allocated via the percpu allocator or
> > be statically present (kmalloc bootstrap situation).
> >
> > --
>
> Do we have a user program to check actual L1 cache size of a machine ?

If there is no, it's easy to write it as kernel exports the cache stat by
/sys/devices/system/cpu/cpuXXX/cache/indexXXX/

Eric Dumazet

unread,

Apr 8, 2010, 1:00:02 AM4/8/10

to

Le jeudi 08 avril 2010 à 09:05 +0800, Zhang, Yanmin a écrit :

> > Do we have a user program to check actual L1 cache size of a machine ?
> If there is no, it's easy to write it as kernel exports the cache stat by
> /sys/devices/system/cpu/cpuXXX/cache/indexXXX/

Yes, this is what advertizes my L1 cache having 64bytes lines, but I
would like to check that in practice, this is not 128bytes...

./index0/type:Data
./index0/level:1
./index0/coherency_line_size:64
./index0/physical_line_partition:1
./index0/ways_of_associativity:8
./index0/number_of_sets:64
./index0/size:32K
./index0/shared_cpu_map:00000101
./index0/shared_cpu_list:0,8
./index1/type:Instruction
./index1/level:1
./index1/coherency_line_size:64
./index1/physical_line_partition:1
./index1/ways_of_associativity:4
./index1/number_of_sets:128
./index1/size:32K
./index1/shared_cpu_map:00000101
./index1/shared_cpu_list:0,8

Eric Dumazet

unread,

Apr 8, 2010, 1:40:02 AM4/8/10

to

I suspect NUMA is completely out of order on current kernel, or my
Nehalem machine NUMA support is a joke

# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 3071 MB
node 0 free: 2637 MB
node 1 size: 3062 MB
node 1 free: 2909 MB

# cat try.sh
hackbench 50 process 5000
numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 &
numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 &
wait
echo node0 results
cat RES0
echo node1 results
cat RES1

numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
wait
echo node0 on mem1 results
cat RES0_1
echo node1 on mem0 results
cat RES1_0

# ./try.sh
Running with 50*40 (== 2000) tasks.
Time: 16.865
node0 results

Running with 25*40 (== 1000) tasks.

Time: 16.767
node1 results

Running with 25*40 (== 1000) tasks.

Time: 16.564
node0 on mem1 results

Running with 25*40 (== 1000) tasks.

Time: 16.814
node1 on mem0 results

Running with 25*40 (== 1000) tasks.

Time: 16.896

Eric Dumazet

unread,

Apr 8, 2010, 3:10:01 AM4/8/10

to

If run individually, the tests results are more what we would expect
(slow), but if machine runs the two set of process concurrently, each
group runs much faster...

# numactl --cpubind=0 --membind=1 hackbench 25 process 5000

Running with 25*40 (== 1000) tasks.

Time: 21.810

# numactl --cpubind=1 --membind=0 hackbench 25 process 5000

Running with 25*40 (== 1000) tasks.

Time: 20.679

# numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 &
[1] 9177
# numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 &
[2] 9196
# wait
[1]- Done numactl --cpubind=0 --membind=1 hackbench
25 process 5000 >RES0_1
[2]+ Done numactl --cpubind=1 --membind=0 hackbench
25 process 5000 >RES1_0
# echo node0 on mem1 results
node0 on mem1 results
# cat RES0_1

Running with 25*40 (== 1000) tasks.

Time: 13.818
# echo node1 on mem0 results
node1 on mem0 results
# cat RES1_0

Running with 25*40 (== 1000) tasks.

Time: 11.633

Oh well...

David Miller

unread,

Apr 8, 2010, 3:10:02 AM4/8/10

to

From: Eric Dumazet <eric.d...@gmail.com>
Date: Thu, 08 Apr 2010 09:00:19 +0200

> If run individually, the tests results are more what we would expect
> (slow), but if machine runs the two set of process concurrently, each
> group runs much faster...

BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
that loopback TCP packets get fully checksum validated on receive.

I'm trying to figure out why skb->ip_summed ends up being
CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
CHECKSUM_PARTIAL in tcp_sendmsg().

I wonder how much this accounts for some of the hackbench
oddities... and other regressions in loopback tests we've seen.
:-)

Just FYI...

Zhang, Yanmin

unread,

Apr 8, 2010, 3:20:02 AM4/8/10

to

I tested the patch against 2.6.33+9dfc6e68bfe6e and it seems it doesn't help.

I dumped percpu allocation info when booting kernel and didn't find clear sign.

Eric Dumazet

unread,

Apr 8, 2010, 3:30:02 AM4/8/10

to

Le jeudi 08 avril 2010 à 00:05 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.d...@gmail.com>
> Date: Thu, 08 Apr 2010 09:00:19 +0200
>
> > If run individually, the tests results are more what we would expect
> > (slow), but if machine runs the two set of process concurrently, each
> > group runs much faster...
>
> BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
> that loopback TCP packets get fully checksum validated on receive.
>
> I'm trying to figure out why skb->ip_summed ends up being
> CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
> CHECKSUM_PARTIAL in tcp_sendmsg().
>
> I wonder how much this accounts for some of the hackbench
> oddities... and other regressions in loopback tests we've seen.
> :-)
>
> Just FYI...

Thanks !

But hackbench is a af_unix benchmark, so loopback stuff is not used that
much :)

David Miller

unread,

Apr 8, 2010, 3:30:02 AM4/8/10

to

From: David Miller <da...@davemloft.net>
Date: Thu, 08 Apr 2010 00:05:57 -0700 (PDT)

> From: Eric Dumazet <eric.d...@gmail.com>
> Date: Thu, 08 Apr 2010 09:00:19 +0200
>
>> If run individually, the tests results are more what we would expect
>> (slow), but if machine runs the two set of process concurrently, each
>> group runs much faster...
>
> BTW, I just discovered (thanks to the function graph tracer, woo hoo!)
> that loopback TCP packets get fully checksum validated on receive.
>
> I'm trying to figure out why skb->ip_summed ends up being
> CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to
> CHECKSUM_PARTIAL in tcp_sendmsg().

Ok, it looks like it's only ACK packets that have this problem,
but still :-)

It's weird that we have a special ip_dev_loopback_xmit() for for
ip_mc_output() NF_HOOK()s, which forces skb->ip_summed to
CHECKSUM_UNNECESSARY, but the actual normal loopback xmit doesn't
do that...

Zhang, Yanmin

unread,

Apr 8, 2010, 4:00:03 AM4/8/10

to

If there are 2 nodes in the machine, processes on node 0 will contact MCH of
node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
a power-saving mode when all the cpus of node 1 are free. So the transactions
from MCH 1 to MCH 0 has a larger latency.

Eric Dumazet

unread,

Apr 8, 2010, 4:00:02 AM4/8/10

to

Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit :

> If there are 2 nodes in the machine, processes on node 0 will contact MCH of
> node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
> a power-saving mode when all the cpus of node 1 are free. So the transactions
> from MCH 1 to MCH 0 has a larger latency.
>

Hmm, thanks for the hint, I will investigate this.

Eric Dumazet

unread,

Apr 8, 2010, 4:20:02 AM4/8/10

to

Le jeudi 08 avril 2010 à 09:54 +0200, Eric Dumazet a écrit :
> Le jeudi 08 avril 2010 à 15:54 +0800, Zhang, Yanmin a écrit :
>
> > If there are 2 nodes in the machine, processes on node 0 will contact MCH of
> > node 1 to access memory of node 1. I suspect the MCH of node 1 might enter
> > a power-saving mode when all the cpus of node 1 are free. So the transactions
> > from MCH 1 to MCH 0 has a larger latency.
> >
>
> Hmm, thanks for the hint, I will investigate this.

Oh well,

perf timechart record &

Instant crash

Call Trace:
perf_trace_sched_switch+0xd5/0x120
schedule+0x6b5/0x860
retint_careful+0xd/0x21

RIP ffffffff81010955 perf_arch_fetch_caller_regs+0x15/0x40
CR2: 00000000d21f1422

Christoph Lameter

unread,

Apr 8, 2010, 11:40:02 AM4/8/10

to

On Thu, 8 Apr 2010, Eric Dumazet wrote:

> I suspect NUMA is completely out of order on current kernel, or my
> Nehalem machine NUMA support is a joke
>
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 size: 3071 MB
> node 0 free: 2637 MB
> node 1 size: 3062 MB
> node 1 free: 2909 MB

How do the cpus map to the nodes? cpu 0 and 1 both on the same node?

Eric Dumazet

unread,

Apr 8, 2010, 12:00:03 PM4/8/10

to

Le jeudi 08 avril 2010 à 10:34 -0500, Christoph Lameter a écrit :
> On Thu, 8 Apr 2010, Eric Dumazet wrote:
>
> > I suspect NUMA is completely out of order on current kernel, or my
> > Nehalem machine NUMA support is a joke
> >
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 size: 3071 MB
> > node 0 free: 2637 MB
> > node 1 size: 3062 MB
> > node 1 free: 2909 MB
>
> How do the cpus map to the nodes? cpu 0 and 1 both on the same node?

one socket maps to 0 2 4 6 8 10 12 14 (Node 0)
one socket maps to 1 3 5 7 9 11 13 15 (Node 1)

# numactl --cpubind=0 --membind=0 numactl --show
policy: bind
preferred node: 0
interleavemask:
interleavenode: 0
nodebind: 0
membind: 0
cpubind: 1 3 5 7 9 11 13 15 1024

(strange 1024 report...)

# numactl --cpubind=1 --membind=1 numactl --show
policy: bind
preferred node: 1
interleavemask:
interleavenode: 0
nodebind:
membind: 1
cpubind: 0 2 4 6 8 10 12 14

[ 0.161170] Booting Node 0, Processors #1
[ 0.248995] CPU 1 MCA banks CMCI:2 CMCI:3 CMCI:5 CMCI:6 SHD:8
[ 0.269177] Ok.
[ 0.269453] Booting Node 1, Processors #2
[ 0.356965] CPU 2 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[ 0.377207] Ok.
[ 0.377485] Booting Node 0, Processors #3
[ 0.464935] CPU 3 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[ 0.485065] Ok.
[ 0.485217] Booting Node 1, Processors #4
[ 0.572906] CPU 4 MCA banks CMCI:2 CMCI:3 CMCI:5 SHD:6 SHD:8
[ 0.593044] Ok.
...
grep "physical id" /proc/cpuinfo
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0
physical id : 1
physical id : 0