And I have just realized that I forgot about the daemon stack:
# cat /proc/573/stack
[<c019c981>] shrink_zone+0x1b9/0x455
[<c019d462>] do_try_to_free_pages+0x9d/0x301
[<c019d803>] try_to_free_pages+0xb3/0x104
[<c01966d7>] __alloc_pages_nodemask+0x358/0x589
[<c01bf314>] khugepaged+0x13f/0xc60
[<c014c301>] kthread+0x67/0x6c
[<c0102db6>] kernel_thread_helper+0x6/0x10
[<ffffffff>] 0xffffffff
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
And the issue is gone. Same like it came - without any obvious trigger.
$ cat /proc/573/stack
[<c01bfc8a>] khugepaged+0xab5/0xc60
[<c014c301>] kthread+0x67/0x6c
[<c0102db6>] kernel_thread_helper+0x6/0x10]]]
So it looks like it managed to free some pages. Anyway it took a while
(I would say an hour) to do it so something seems to be fishy.
Just for reference I am attaching also the allocator state:
====
# cat /proc/buddyinfo
Node 0, zone DMA 8 3 2 5 6 5 3 1 1 1 1
Node 0, zone Normal 2670 3000 2194 1489 1069 690 363 180 84 43 2
Node 0, zone HighMem 657 3474 1338 873 254 12 0 0 0 0 0
====
# cat /proc/vmstat
nr_free_pages 187236
nr_inactive_anon 27928
nr_active_anon 94882
nr_inactive_file 78838
nr_active_file 97601
nr_unevictable 0
nr_mlock 0
nr_anon_pages 71401
nr_mapped 20701
nr_file_pages 219722
nr_dirty 8
nr_writeback 0
nr_slab_reclaimable 4269
nr_slab_unreclaimable 4188
nr_page_table_pages 827
nr_kernel_stack 235
nr_unstable 0
nr_bounce 0
nr_vmscan_write 23503
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 39642
nr_dirtied 1735727
nr_written 1607464
nr_anon_transparent_hugepages 9
nr_dirty_threshold 77463
nr_dirty_background_threshold 19365
pgpgin 19253206
pgpgout 6972134
pswpin 4233
pswpout 23401
pgalloc_dma 11688
pgalloc_normal 153681036
pgalloc_high 45511627
pgalloc_movable 0
pgfree 199845128
pgactivate 1831896
pgdeactivate 318554
pgfault 87302686
pgmajfault 15523
pgrefill_dma 288
pgrefill_normal 93009
pgrefill_high 200394
pgrefill_movable 0
pgsteal_dma 0
pgsteal_normal 3949660
pgsteal_high 601671
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_normal 3678094
pgscan_kswapd_high 366447
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_normal 290984
pgscan_direct_high 303477
pgscan_direct_movable 0
pginodesteal 73185
slabs_scanned 353792
kswapd_steal 4026528
kswapd_inodesteal 173760
kswapd_low_wmark_hit_quickly 6
kswapd_high_wmark_hit_quickly 7758
kswapd_skip_congestion_wait 0
pageoutrun 79411
allocstall 310
pgrotated 22447
compact_blocks_moved 11205
compact_pages_moved 325766
compact_pagemigrate_failed 6165
compact_stall 347
compact_fail 67
compact_success 280
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 1093
unevictable_pgs_scanned 0
unevictable_pgs_rescued 359
unevictable_pgs_mlocked 1307
unevictable_pgs_munlocked 1306
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
====
# grep . -r /sys/kernel/mm/transparent_hugepage/
/sys/kernel/mm/transparent_hugepage/enabled:[always] madvise never
/sys/kernel/mm/transparent_hugepage/defrag:always madvise [never]
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag:yes [no]
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none:1023
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan:8192
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed:1524
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans:1510
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs:10000
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs:60000
On Mon, Feb 07, 2011 at 10:16:01PM +0100, Michal Hocko wrote:
> On Mon 07-02-11 22:06:54, Michal Hocko wrote:
> > Hi Andrea,
> >
> > I am currently running into an issue when khugepaged is running 100% on
> > one of my CPUs for a long time (at least one hour as I am writing the
> > email). The kernel is the clean 2.6.38-rc3 (i386) vanilla kernel.
> >
> > I have tried to disable defrag but it didn't help (I haven't rebooted
> > after setting the value). I am not sure what information is helpful and
> > also not sure whether I am able to reproduce it after restart (it is the
> > first time I can see this problem) so sorry for the poor report.
> >
> > Here is some basic info which might be useful (config and sysrq+t are
> > attached):
> > =========
>
> And I have just realized that I forgot about the daemon stack:
> # cat /proc/573/stack
> [<c019c981>] shrink_zone+0x1b9/0x455
> [<c019d462>] do_try_to_free_pages+0x9d/0x301
> [<c019d803>] try_to_free_pages+0xb3/0x104
> [<c01966d7>] __alloc_pages_nodemask+0x358/0x589
> [<c01bf314>] khugepaged+0x13f/0xc60
> [<c014c301>] kthread+0x67/0x6c
> [<c0102db6>] kernel_thread_helper+0x6/0x10
> [<ffffffff>] 0xffffffff
It would be great to know if __alloc_pages_nodemask returned or if it
was calling it in a loop.
When __alloc_pages_nodemask fails in collapse_huge_page, hpage is set
to ERR_PTR(-ENOMEM), then khugepaged_scan_pmd returns 1, then
khugepaged_scan_mm_slot goto breakouterloop_mmap_sem and return
progress, then the khugepaged_do_scan main loop should notice that
IS_ERR(*hpage) is set and break out of the loop and return void, then
khugepaged_loop should notice that IS_ERR(hpage) is set and it should
throttle for alloc_sleep_millisecs inside khugepaged_alloc_sleep
before setting hpage to NULL and trying again to allocate. I wonder
what could be going wrong in khugepaged.. I wonder if it's a bug inside
__alloc_pages_nodemask and not a khugepaged issue. Best would be if
you run SYSRQ+l several times (/proc/*/stack don't seem to be the best
for running tasks even if it should be accurate enough already, but if
you run it often and with sysrq+l it'll be more clear what is
running).
I hope you can reproduce, if it's an allocator issue you should notice
it again by keeping the same workload on that same system. I doubt I
can reproduce at the moment as I don't know what's going on to
simulate your load.
Thanks a lot,
Andrea
OK, I will try if I see it again.
>
> I hope you can reproduce, if it's an allocator issue you should notice
> it again by keeping the same workload on that same system. I doubt I
> can reproduce at the moment as I don't know what's going on to
> simulate your load.
My workload is rather "normal", I would say. Firefox with couple of
tabs, skype, mutt, xine (wathing the stream television), kernel builds)
and repeated suspend/wakeup cycles (I am rebooting only when installing
a new kernel - aka new rc is released or I need to test something). So
it is hard to find out what triggered this situation.
I will let you know when I get into the same situation and provide the
sysrq+l.
Thanks
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic