Re: [numa shrinker] 9b17c62382: -36.6% regression on sparse file copy

Dave Chinner

unread,

Jan 8, 2014, 3:10:01 AM1/8/14

to

On Mon, Jan 06, 2014 at 04:20:48PM +0800, fenggu...@intel.com wrote:
> Hi Dave,
>
> We noticed throughput drop in test case
>
> vm-scalability/300s-lru-file-readtwice (*)
>
> between v3.11 and v3.12, and it's still low as of v3.13-rc6:
>
> v3.11 v3.12 v3.13-rc6
> --------------- ------------------------- -------------------------
> 14934707 ~ 0% -48.8% 7647311 ~ 0% -47.6% 7829487 ~ 0% vm-scalability.throughput
> ^^ ^^^^^^
> stddev% change%

What does this vm-scalability.throughput number mean?

> (*) The test case basically does
>
> truncate -s 135080058880 /tmp/vm-scalability.img
> mkfs.xfs -q /tmp/vm-scalability.img
> mount -o loop /tmp/vm-scalability.img /tmp/vm-scalability
>
> nr_cpu=120
> for i in $(seq 1 $nr_cpu)
> do
> sparse_file=/tmp/vm-scalability/sparse-lru-file-readtwice-$i
> truncate $sparse_file -s 36650387592
> dd if=$sparse_file of=/dev/null &
> dd if=$sparse_file of=/dev/null &
> done

So a page cache load of reading 120x36GB files twice concurrently?
There's no increase in system time, so it can't be that the
shrinkers are running wild.

FWIW, I'm at LCA right now, so it's going to be a week before I can
look at this, so if you can find any behavioural difference in the
shrinkers (e.g. from perf profiles, on different filesystems, etc)
I'd appreciate it...

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Fengguang Wu

unread,

Jan 8, 2014, 6:20:02 AM1/8/14

to

On Tue, Jan 07, 2014 at 12:10:42AM +1100, Dave Chinner wrote:
> On Mon, Jan 06, 2014 at 04:20:48PM +0800, fenggu...@intel.com wrote:
> > Hi Dave,
> >
> > We noticed throughput drop in test case
> >
> > vm-scalability/300s-lru-file-readtwice (*)
> >
> > between v3.11 and v3.12, and it's still low as of v3.13-rc6:
> >
> > v3.11 v3.12 v3.13-rc6
> > --------------- ------------------------- -------------------------
> > 14934707 ~ 0% -48.8% 7647311 ~ 0% -47.6% 7829487 ~ 0% vm-scalability.throughput
> > ^^ ^^^^^^
> > stddev% change%
>
> What does this vm-scalability.throughput number mean?

It's the total throughput reported by all the 240 dd:

8781176832 bytes (8.8 GB) copied, 299.97 s, 29.3 MB/s
2124931+0 records in
2124930+0 records out
8703713280 bytes (8.7 GB) copied, 299.97 s, 29.0 MB/s
2174078+0 records in
2174077+0 records out
...

> > (*) The test case basically does
> >
> > truncate -s 135080058880 /tmp/vm-scalability.img
> > mkfs.xfs -q /tmp/vm-scalability.img
> > mount -o loop /tmp/vm-scalability.img /tmp/vm-scalability
> >
> > nr_cpu=120
> > for i in $(seq 1 $nr_cpu)
> > do
> > sparse_file=/tmp/vm-scalability/sparse-lru-file-readtwice-$i
> > truncate $sparse_file -s 36650387592
> > dd if=$sparse_file of=/dev/null &
> > dd if=$sparse_file of=/dev/null &
> > done
>
> So a page cache load of reading 120x36GB files twice concurrently?

Yes.

> There's no increase in system time, so it can't be that the
> shrinkers are running wild.
>
> FWIW, I'm at LCA right now, so it's going to be a week before I can
> look at this, so if you can find any behavioural difference in the
> shrinkers (e.g. from perf profiles, on different filesystems, etc)
> I'd appreciate it...

OK, enjoy your time! I'll try different parameters and check if that
makes any difference.

Thanks,
Fengguang

Fengguang Wu

unread,

Jan 8, 2014, 10:00:02 PM1/8/14

to

Hi Dave,

As you suggested, I added tests for ext4 and btrfs, the results are
the same.

Then I tried running perf record for 10 seconds starting from 200s.
(The test runs for 410s). I see several warning messages and hope
they do not impact the accuracy too much:

[ 252.608069] perf samples too long (2532 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[ 252.608863] perf samples too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 25000
[ 252.609422] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.389 msecs

Anyway the noticeable perf change are:

1d3d4437eae1bb2 9b17c62382dd2e7507984b989
--------------- -------------------------
12.15 ~10% +209.8% 37.63 ~ 2% brickland2/debug2/vm-scalability/300s-btrfs-lru-file-readtwice
12.88 ~16% +189.4% 37.27 ~ 0% brickland2/debug2/vm-scalability/300s-ext4-lru-file-readtwice
15.24 ~ 9% +146.0% 37.50 ~ 1% brickland2/debug2/vm-scalability/300s-xfs-lru-file-readtwice
40.27 +179.1% 112.40 TOTAL perf-profile.cpu-cycles._raw_spin_lock.grab_super_passive.super_cache_count.shrink_slab.do_try_to_free_pages

1d3d4437eae1bb2 9b17c62382dd2e7507984b989
--------------- -------------------------
11.91 ~12% +218.2% 37.89 ~ 2% brickland2/debug2/vm-scalability/300s-btrfs-lru-file-readtwice
12.47 ~16% +200.3% 37.44 ~ 0% brickland2/debug2/vm-scalability/300s-ext4-lru-file-readtwice
15.36 ~11% +145.4% 37.68 ~ 1% brickland2/debug2/vm-scalability/300s-xfs-lru-file-readtwice
39.73 +184.5% 113.01 TOTAL perf-profile.cpu-cycles._raw_spin_lock.put_super.drop_super.super_cache_count.shrink_slab

perf report for 9b17c62382dd2e7507984b989:

# Overhead Command Shared Object Symbol
# ........ ............... .................. ..............................................
#
77.74% dd [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--47.65%-- grab_super_passive
| super_cache_count
| shrink_slab
| do_try_to_free_pages
| try_to_free_pages
| __alloc_pages_nodemask
| alloc_pages_current
| __page_cache_alloc
| __do_page_cache_readahead
| ra_submit
| ondemand_readahead
| |
| |--92.13%-- page_cache_async_readahead
| | generic_file_aio_read
| | xfs_file_aio_read
| | do_sync_read
| | vfs_read
| | SyS_read
| | system_call_fastpath
| | read
| |
| --7.87%-- page_cache_sync_readahead
| generic_file_aio_read
| xfs_file_aio_read
| do_sync_read
| vfs_read
| SyS_read
| system_call_fastpath
| read
|--47.48%-- put_super
| drop_super
| super_cache_count
| shrink_slab
| do_try_to_free_pages
| try_to_free_pages
| __alloc_pages_nodemask
| alloc_pages_current
| __page_cache_alloc
| __do_page_cache_readahead
| ra_submit
| ondemand_readahead
| |
| |--92.04%-- page_cache_async_readahead
| | generic_file_aio_read
| | xfs_file_aio_read
| | do_sync_read
| | vfs_read
| | SyS_read
| | system_call_fastpath
| | read
| |
| --7.96%-- page_cache_sync_readahead
| generic_file_aio_read
| xfs_file_aio_read
| do_sync_read
| vfs_read
| SyS_read
| system_call_fastpath
| read
--4.87%-- [...]

The full changeset is attached.

Thanks,
Fengguang

9b17c62382dd2e7507984b9890bf44e070cdd8bb

Dave Chinner

unread,

Jan 14, 2014, 7:20:01 PM1/14/14

to

Oh, it's superblock lock contention, probably caused by an increase
in shrinker calls (i.e. per-node rather than global). I think we've
seen this before - can you try the two patches from Tim Chen here:

https://lkml.org/lkml/2013/9/6/353
https://lkml.org/lkml/2013/9/6/356

If they fix the problem, I'll get them into 3.14 and pushed back to
the relevant stable kernels.

Fengguang Wu

unread,

Jan 27, 2014, 7:20:02 AM1/27/14

to

Hi Dave,

Yes, the two patches help a lot:

9b17c62382dd2e7 8401edd4b12960c703233f4ed
--------------- -------------------------
6748913 ~ 2% +37.5% 9281049 ~ 1% brickland2/debug2/vm-scalability/300s-btrfs-lru-file-readtwice
8417200 ~ 0% +56.5% 13172417 ~ 0% brickland2/debug2/vm-scalability/300s-ext4-lru-file-readtwice
8333983 ~ 1% +56.9% 13078610 ~ 0% brickland2/debug2/vm-scalability/300s-xfs-lru-file-readtwice
23500096 ~ 1% +51.2% 35532077 ~ 0% TOTAL vm-scalability.throughput

They restore performance numbers back to 1d3d4437eae1bb2's level
(which is 9b17c62382's parent commit).

Thanks,
Fengguang

Tim Chen

unread,

Jan 28, 2014, 2:10:01 PM1/28/14

to

Dave,

You're going to merge the two patches to 3.14?

Thanks.

Tim