Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Deadlocks with transparent huge pages and userspace fs daemons

757 views
Skip to first unread message

Dave Hansen

unread,
Nov 3, 2010, 4:43:42 PM11/3/10
to Miklos Szeredi, Andrea Arcangeli, linux-fsdevel, linux-mm, linux-...@vger.kernel.org, Lin Feng Shen, Yuri L Volobuev, Mel Gorman, di...@cn.ibm.com, lnxninja
Hey Miklos,

When testing with a transparent huge page kernel:

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/andrea/aa.git;a=summary

some IBM testers ran into some deadlocks. It appears that the
khugepaged process is trying to migrate one of a filesystem daemon's
pages while khugepaged holds the daemon's mmap_sem for write.

I think I've reproduced this issue in a slightly different form with
FUSE. In my case, I think the FUSE process actually deadlocks on itself
instead of with khugepaged as in the IBM tester example that got me
looking at this.

Andrea put it this way:
> As long as page faults are needed to execute the I/O I doubt it's safe. But
> I'll definitely change khugepaged not to allocate memory. If nothing else
> because I don't want khugepaged to make easier to trigger issues like this. But
> it's hard for me to consider this a bug of khugepaged from a theoretical
> standpoint.

I tend to agree. khugepaged makes the likelyhood of these things
happening much higher, but I don't think it fundamentally creates the
issue.

Should we do something like make page compaction always non-blocking on
lock_page()? Should we teach the VM about fuse daemons somehow?

INFO: task unionfs:3527 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
unionfs D ffff88007d356ec0 0 3527 3478 0x00000000
ffff88007b0db9a8 0000000000000082 ffffea00000650c8 ffff88007d356c70
ffff88007d1286a0 000000000000000d 0000000000000000 0000000000000301
ffff88007b0db978 ffffffff81098f70 ffff88007b0dba58 ffff880001db1f40
Call Trace:
[<ffffffff81098f70>] ? vma_prio_tree_next+0x3c/0x52
[<ffffffff813eb183>] io_schedule+0x38/0x4d
[<ffffffff8108683a>] sync_page+0x44/0x48
[<ffffffff813eb5e7>] __wait_on_bit_lock+0x42/0x8a
[<ffffffff810867f6>] ? sync_page+0x0/0x48
[<ffffffff810867e2>] __lock_page+0x64/0x6b
[<ffffffff810467bb>] ? wake_bit_function+0x0/0x2a
[<ffffffff810bce62>] migrate_pages+0x1df/0x66b
[<ffffffff810b8b33>] ? compaction_alloc+0x0/0x2b9
[<ffffffff8108fa2c>] ? ____pagevec_lru_add+0x13c/0x14f
[<ffffffff810b85e5>] compact_zone+0x331/0x54d
[<ffffffff810b89e4>] compact_zone_order+0xaa/0xb9
[<ffffffff810b8acd>] try_to_compact_pages+0xda/0x140
[<ffffffff8108c3f0>] __alloc_pages_nodemask+0x3a6/0x74b
[<ffffffff810b5db5>] alloc_pages_vma+0x110/0x13d
[<ffffffff810c6d6d>] do_huge_pmd_anonymous_page+0xc0/0x287
[<ffffffff810a0ed7>] handle_mm_fault+0x15c/0x201
[<ffffffff813efa5c>] do_page_fault+0x304/0x422
[<ffffffff810a5e8a>] ? do_brk+0x282/0x2c8
[<ffffffff813ed40f>] page_fault+0x1f/0x30

I had to make some changes to the transparent huge page code to get this
to happen. First, I made the scanning *REALLY* aggressive:

echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

Then, I hacked migrate_pages() call of unmap_and_move() to always
'force', so that it tries to lock_page() unconditionally. That's just
to make this race more common. I also created some large malloc()'d
memory areas in the unionfs daemon and touched them constantly to cause
lots of page faults.

Other relevant tasks:

INFO: task mmap-and-touch:3584 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mmap-and-touc D ffff88007bd71510 0 3584 3542 0x00000000
ffff88007a591b88 0000000000000086 ffff88007bd57400 ffff88007bd712c0
ffff88007d01cd70 ffffffff00000004 ffff88007d22e578 ffff88005e5b7440
ffff88007a591b58 0000000181182a8c ffff88007a591b88 ffff880001c91f40
Call Trace:
[<ffffffff813eb183>] io_schedule+0x38/0x4d
[<ffffffff8108683a>] sync_page+0x44/0x48
[<ffffffff813eb5e7>] __wait_on_bit_lock+0x42/0x8a
[<ffffffff810867f6>] ? sync_page+0x0/0x48
[<ffffffff810867e2>] __lock_page+0x64/0x6b
[<ffffffff810467bb>] ? wake_bit_function+0x0/0x2a
[<ffffffff810868a1>] find_lock_page+0x39/0x5d
[<ffffffff81087f60>] filemap_fault+0x1a6/0x30e
[<ffffffff8109e5e0>] __do_fault+0x50/0x432
[<ffffffff8109f636>] handle_pte_fault+0x2db/0x717
[<ffffffff8108b67c>] ? __free_pages+0x1b/0x24
[<ffffffff810a0d6c>] ? __pte_alloc+0x112/0x121
[<ffffffff810a0f64>] handle_mm_fault+0x1e9/0x201
[<ffffffff813efa5c>] do_page_fault+0x304/0x422
[<ffffffff810cc83d>] ? sys_newfstat+0x29/0x34
[<ffffffff813ed40f>] page_fault+0x1f/0x30
INFO: task memknobs:3599 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
memknobs D ffff88007d305b20 0 3599 3573 0x00000000
ffff88005e4539a8 0000000000000086 ffff88005e453978 ffff88007d3058d0
ffff88007dbb60d0 ffffea0000000002 000000003963d000 ffff88007a4c11e8
ffffea000033aa10 000000017b1e69e0 ffff88005e453988 ffff880001c51f40
Call Trace:
[<ffffffff813eb183>] io_schedule+0x38/0x4d
[<ffffffff8108683a>] sync_page+0x44/0x48
[<ffffffff813eb5e7>] __wait_on_bit_lock+0x42/0x8a
[<ffffffff810867f6>] ? sync_page+0x0/0x48
[<ffffffff810867e2>] __lock_page+0x64/0x6b
[<ffffffff810467bb>] ? wake_bit_function+0x0/0x2a
[<ffffffff810bce62>] migrate_pages+0x1df/0x66b
[<ffffffff810b8b33>] ? compaction_alloc+0x0/0x2b9
[<ffffffff8108fa2c>] ? ____pagevec_lru_add+0x13c/0x14f
[<ffffffff810b85e5>] compact_zone+0x331/0x54d
[<ffffffff810b89e4>] compact_zone_order+0xaa/0xb9
[<ffffffff810b8acd>] try_to_compact_pages+0xda/0x140
[<ffffffff8108c3f0>] __alloc_pages_nodemask+0x3a6/0x74b
[<ffffffff810b5db5>] alloc_pages_vma+0x110/0x13d
[<ffffffff810c6d6d>] do_huge_pmd_anonymous_page+0xc0/0x287
[<ffffffff810a0ed7>] handle_mm_fault+0x15c/0x201
[<ffffffff813efa5c>] do_page_fault+0x304/0x422
[<ffffffff81020e5e>] ? __dequeue_entity+0x2e/0x33
[<ffffffff81000e25>] ? __switch_to+0x22a/0x23c
[<ffffffff81020e7b>] ? set_next_entity+0x18/0x36
[<ffffffff81022e83>] ? finish_task_switch+0x3c/0x81
[<ffffffff813eb0a5>] ? schedule+0x6f4/0x79a
[<ffffffff813ed40f>] page_fault+0x1f/0x30
INFO: task khugepaged:515 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
khugepaged D ffff88007d1e8360 0 515 2 0x00000000
ffff88007cad5d00 0000000000000046 ffff88007cad5cc0 ffff88007d1e8110
ffff88007d0986e0 0000000000000008 ffff88007cad5ce0 ffffffff81037e33
00000000ffffffff 000000017cad5d50 00000001000dd090 0000000000000002
Call Trace:
[<ffffffff81037e33>] ? lock_timer_base+0x26/0x4a
[<ffffffff813ec8af>] rwsem_down_failed_common+0xcc/0xfe
[<ffffffff813ec8f4>] rwsem_down_write_failed+0x13/0x15
[<ffffffff811ccef3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff813ec09b>] ? down_write+0x20/0x22
[<ffffffff810c6174>] khugepaged+0xee0/0xf5f
[<ffffffff81046783>] ? autoremove_wake_function+0x0/0x38
[<ffffffff810c5294>] ? khugepaged+0x0/0xf5f
[<ffffffff810462ce>] kthread+0x81/0x89
[<ffffffff81002cf4>] kernel_thread_helper+0x4/0x10
[<ffffffff8104624d>] ? kthread+0x0/0x89
[<ffffffff81002cf0>] ? kernel_thread_helper+0x0/0x10


Original stack trace from GPFS deadlock:

> khugepaged D ffff88007c823080 0 52 2 0x00000000
> ffff8800378c98f0 0000000000000046 0000000000000000 001a7949f3208ca4
> ffffffffffffff10 ffff880079efc670 000000002b6c79c0 00000001169be651
> ffff88003780c638 ffff8800378c9fd8 0000000000010518 ffff88003780c638
> Call Trace:
> [<ffffffff8110c060>] ? sync_page+0x0/0x50
> [<ffffffff814c8a23>] io_schedule+0x73/0xc0
> [<ffffffff8110c09d>] sync_page+0x3d/0x50
> [<ffffffff814c914a>] __wait_on_bit_lock+0x5a/0xc0
> [<ffffffff8110c037>] __lock_page+0x67/0x70
> [<ffffffff81091ce0>] ? wake_bit_function+0x0/0x50
> [<ffffffff81122461>] ? lru_cache_add_lru+0x21/0x40
> [<ffffffff8115b730>] lock_page+0x30/0x40
> [<ffffffff8115bdad>] migrate_pages+0x59d/0x5d0
> [<ffffffff81152470>] ? compaction_alloc+0x0/0x370
> [<ffffffff81151f1c>] compact_zone+0x4ac/0x5e0
> [<ffffffff8111cd1c>] ? get_page_from_freelist+0x15c/0x820
> [<ffffffff811522ce>] compact_zone_order+0x7e/0xb0
> [<ffffffff81152409>] try_to_compact_pages+0x109/0x170
> [<ffffffff8111e62c>] __alloc_pages_nodemask+0x55c/0x810
> [<ffffffff81150374>] alloc_pages_vma+0x84/0x110
> [<ffffffff8116530f>] khugepaged+0xa4f/0x1190
> [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffff811648c0>] ? khugepaged+0x0/0x1190
> [<ffffffff81091936>] kthread+0x96/0xa0
> [<ffffffff810141ca>] child_rip+0xa/0x20
> [<ffffffff810918a0>] ? kthread+0x0/0xa0
> [<ffffffff810141c0>] ? child_rip+0x0/0x20
>
>
> mmfsd D ffff88007c823680 0 4453 4118 0x00000080
> ffff88001ad1ddf0 0000000000000082 0000000000000000 0000000000000000
> 0000000000000000 ffff880037fcee40 ffff880079d40ab0 00000001169be9c1
> ffff8800782b7ad8 ffff88001ad1dfd8 0000000000010518 ffff8800782b7ad8
> Call Trace:
> [<ffffffff814c8286>] ? thread_return+0x4e/0x778
> [<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430
> [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ca846>] rwsem_down_read_failed+0x26/0x30
> [<ffffffff81264224>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffff814c9d44>] ? down_read+0x24/0x30
> [<ffffffff814cd6fa>] do_page_fault+0x34a/0x3a0
> [<ffffffff814caf45>] page_fault+0x25/0x30


-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Miklos Szeredi

unread,
Nov 3, 2010, 5:46:47 PM11/3/10
to Dave Hansen, mik...@szeredi.hu, aarc...@redhat.com, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, shen...@cn.ibm.com, volo...@us.ibm.com, m...@linux.vnet.ibm.com, di...@cn.ibm.com, lnxn...@us.ibm.com
On Wed, 03 Nov 2010, Dave Hansen wrote:
> Hey Miklos,
>
> When testing with a transparent huge page kernel:
>
> http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/andrea/aa.git;a=summary
>
> some IBM testers ran into some deadlocks. It appears that the
> khugepaged process is trying to migrate one of a filesystem daemon's
> pages while khugepaged holds the daemon's mmap_sem for write.
>
> I think I've reproduced this issue in a slightly different form with
> FUSE. In my case, I think the FUSE process actually deadlocks on itself
> instead of with khugepaged as in the IBM tester example that got me
> looking at this.
>
> Andrea put it this way:
> > As long as page faults are needed to execute the I/O I doubt it's safe. But
> > I'll definitely change khugepaged not to allocate memory. If nothing else
> > because I don't want khugepaged to make easier to trigger issues like this. But
> > it's hard for me to consider this a bug of khugepaged from a theoretical
> > standpoint.
>
> I tend to agree. khugepaged makes the likelyhood of these things
> happening much higher, but I don't think it fundamentally creates the
> issue.

Yes, I agree too.

I think what is happening is that the fuse daemon is trying to read a
page. While that is happening the page is locked. If the daemon
blocks on a lock_page() for that same page, that is an obvious
deadlock.

This is not unique to fuse, for example NFS or any other network
filesystem is used over userspace tunneling (e.g. openvpn) then the
same thing can happen.

> Should we do something like make page compaction always non-blocking on
> lock_page()?

Yes, at least on !PageUptodate() pages.

Also blocking on page writeback has a similar effect. Fuse is immune
to that because it does writeback in a special way. But the network
fs over userspace tunneling case is not immune AFAICS.

Thanks,
Miklos

Andrea Arcangeli

unread,
Nov 4, 2010, 12:42:17 PM11/4/10
to Dave Hansen, Miklos Szeredi, linux-fsdevel, linux-mm, linux-...@vger.kernel.org, Lin Feng Shen, Yuri L Volobuev, Mel Gorman, di...@cn.ibm.com, lnxninja
On Wed, Nov 03, 2010 at 01:43:25PM -0700, Dave Hansen wrote:
> Hey Miklos,
>
> When testing with a transparent huge page kernel:
>
> http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/andrea/aa.git;a=summary
>
> some IBM testers ran into some deadlocks. It appears that the
> khugepaged process is trying to migrate one of a filesystem daemon's
> pages while khugepaged holds the daemon's mmap_sem for write.

Correct. So now I'm wondering what happens if some library of this
daemon happens to execute a munmap that calls split_vma and allocates
memory while holding the mmap_sem, and the memory allocation triggers
I/O that will have to be executed by the daemon.

> I think I've reproduced this issue in a slightly different form with
> FUSE. In my case, I think the FUSE process actually deadlocks on itself
> instead of with khugepaged as in the IBM tester example that got me
> looking at this.
>
> Andrea put it this way:
> > As long as page faults are needed to execute the I/O I doubt it's safe. But
> > I'll definitely change khugepaged not to allocate memory. If nothing else
> > because I don't want khugepaged to make easier to trigger issues like this. But
> > it's hard for me to consider this a bug of khugepaged from a theoretical
> > standpoint.
>
> I tend to agree. khugepaged makes the likelyhood of these things
> happening much higher, but I don't think it fundamentally creates the
> issue.

Yep.

So I understand the gpfs deadlock perfectly, and like you said it's
hard for me to consider it a definitive khugepaged bug from a
theoretical standpoint, but I definitely plan to send a 67/66 patch
that moves the khugepaged allocation outside of the mmap_sem write
mode. In fact if you were to run with CONFIG_NUMA=n likely it would
already work just fine (but if you use gpfs I bet you need NUMA=y :).

I need to check your other reproducer better.

I think this could be fixed in userland, this applies to openvpn too
if used as nfs backend.

And for the record, this filesystem in userland scanned by khugepaged
("fuse/gpfs") is the only open issue there is with the patchset (with
regard to the version for 2.6.32), no other issue at all as far as
2.6.32 is concerned so the same should apply to the 2.6.37-rc
version. cgroups still need more work in the 2.6.37-rc version
though. Fixing the above is actually a walk in the park compared to
the cgroup updates, and I'll do it very soon, just I didn't think it
was important enough to defer the submit as I can send a simple
67/66. Plus the problem is not reproducible unless you're in the stress
test scenario you described above that is unlikely to match any
production environment (and no corruption can ever materialize because
of this, it's just a subtle annoyance that we need to get rid of).

Thanks a lot for all very useful testing, I hope you'll be able to
verify the fix as soon as I send it (likely next week as I'm on
travel). If you want fix yourself go ahead, you only need to find the
vma and do the allocation before dropping the mmap_sem read mode and
taking the mmap_sem write mode to call collapse_huge_page (plus some
other detail).

Andrea

Miklos Szeredi

unread,
Nov 4, 2010, 3:54:03 PM11/4/10
to Andrea Arcangeli, da...@linux.vnet.ibm.com, mik...@szeredi.hu, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, shen...@cn.ibm.com, volo...@us.ibm.com, m...@linux.vnet.ibm.com, di...@cn.ibm.com, lnxn...@us.ibm.com
On Thu, 4 Nov 2010, Andrea Arcangeli wrote:
> On Wed, Nov 03, 2010 at 01:43:25PM -0700, Dave Hansen wrote:
> > some IBM testers ran into some deadlocks. It appears that the
> > khugepaged process is trying to migrate one of a filesystem daemon's
> > pages while khugepaged holds the daemon's mmap_sem for write.
>
> Correct. So now I'm wondering what happens if some library of this
> daemon happens to execute a munmap that calls split_vma and allocates
> memory while holding the mmap_sem, and the memory allocation triggers
> I/O that will have to be executed by the daemon.

mmap_sem is not really relevant here(*), page lock is. And in vmscan.c,
there's not a single blocking lock_page().

Also, as I mentioned, fuse does writeback in a special way: it copies dirty
pages to non-page cache pages which don't interact in any way with
reclaim. Fuse writeback is instantaneous from the reclaim PoV.

> I think this could be fixed in userland, this applies to openvpn too
> if used as nfs backend.

How?

Thanks,
Miklos

(*) In the original gpfs trace it is relevant but only because the
page migration is triggered by khugepaged. In the reproduced example
the page migration is triggered directly by an allocation. Since page
migration does blocking lock_page(), there's really no way to avoid a
deadlock in that case.

Andrea Arcangeli

unread,
Dec 14, 2010, 12:47:03 PM12/14/10
to Dave Hansen, Miklos Szeredi, linux-fsdevel, linux-mm, linux-...@vger.kernel.org, Lin Feng Shen, Yuri L Volobuev, Mel Gorman, di...@cn.ibm.com, lnxninja
Hello Dave and everyone,

On Wed, Nov 03, 2010 at 01:43:25PM -0700, Dave Hansen wrote:

> Hey Miklos,
>
> When testing with a transparent huge page kernel:
>
> http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/andrea/aa.git;a=summary
>
> some IBM testers ran into some deadlocks. It appears that the
> khugepaged process is trying to migrate one of a filesystem daemon's
> pages while khugepaged holds the daemon's mmap_sem for write.

The allocation under mmap_sem write mode in khugepaged bug should be
fixed in current aa.git based on 37-rc5:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=83e4d55d0014b3eeb982005d73f55ffcf2813504

Let me know how it goes, it's not very well tested yet (which is why I
didn't make a new submit yet).

I stick to my idea this is bug in userland and may trigger if your
daemon does mmap/munmap and the vma allocation under mmap_sem waits
for the I/O, but I don't want to show it with THP enabled, and this is
more scalable so it's definitely good idea and no downside whatsoever.

Thanks for the report,
Andrea

Miklos Szeredi

unread,
Dec 14, 2010, 4:04:01 PM12/14/10
to Andrea Arcangeli, da...@linux.vnet.ibm.com, mik...@szeredi.hu, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, shen...@cn.ibm.com, volo...@us.ibm.com, m...@linux.vnet.ibm.com, di...@cn.ibm.com, lnxn...@us.ibm.com
On Tue, 14 Dec 2010, Andrea Arcangeli wrote:
> Hello Dave and everyone,
>
> On Wed, Nov 03, 2010 at 01:43:25PM -0700, Dave Hansen wrote:
> > Hey Miklos,
> >
> > When testing with a transparent huge page kernel:
> >
> > http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/andrea/aa.git;a=summary
> >
> > some IBM testers ran into some deadlocks. It appears that the
> > khugepaged process is trying to migrate one of a filesystem daemon's
> > pages while khugepaged holds the daemon's mmap_sem for write.
>
> The allocation under mmap_sem write mode in khugepaged bug should be
> fixed in current aa.git based on 37-rc5:
>
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=83e4d55d0014b3eeb982005d73f55ffcf2813504
>
> Let me know how it goes, it's not very well tested yet (which is why I
> didn't make a new submit yet).
>
> I stick to my idea this is bug in userland and may trigger if your
> daemon does mmap/munmap and the vma allocation under mmap_sem waits
> for the I/O, but I don't want to show it with THP enabled, and this is
> more scalable so it's definitely good idea and no downside whatsoever.

This is all fine and dandy, but please let's not forget about the
other thing that Dave's test uncovered. Namely that page migration
triggered by transparent hugepages takes the page lock on arbitrary
filesystems. This is also deadlocky on fuse, but also not a good idea
for any filesystem where page reading time is not bounded (think NFS
with network down).

Thanks,
Miklos

Andrea Arcangeli

unread,
Dec 15, 2010, 12:25:20 AM12/15/10
to Miklos Szeredi, da...@linux.vnet.ibm.com, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, shen...@cn.ibm.com, volo...@us.ibm.com, m...@linux.vnet.ibm.com, di...@cn.ibm.com, lnxn...@us.ibm.com
Hello Miklos and everyone,

On Tue, Dec 14, 2010 at 10:03:33PM +0100, Miklos Szeredi wrote:
> This is all fine and dandy, but please let's not forget about the
> other thing that Dave's test uncovered. Namely that page migration
> triggered by transparent hugepages takes the page lock on arbitrary
> filesystems. This is also deadlocky on fuse, but also not a good idea
> for any filesystem where page reading time is not bounded (think NFS
> with network down).

In #33 I fixed the mmap_sem write issue which is more clear to me and
it makes the code better.

The page lock I don't have full picture on it. Notably there is no
waiting on page lock on khugepaged and khugepaged can't use page
migration (it's not migrating, it's collapsing).

The page lock mentioned in migration context I don't see how can it be
related to THP. There's not a _single_ lock_page in mm/huge_memory.c .

If fuse has deadlock troubles in migration lock_page then I would
guess THP has nothing to do with it memory compaction, and it can
trigger already in upstream stable 2.6.36 when CONFIG_COMPACTION=y by
just doing:

echo 1024 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

or by simply insmodding a driver that tries a large
alloc_pages(order).

My understanding of Dave's trace is that THP makes it easier to
reproduce, but this isn't really THP related, it can happen already
upstream without my patchset applied, and it's just a pure coincidence
that THP makes it more easy to reproduce. How to fix I'm not sure yet
as I didn't look into it closely as I was focusing on rolling a THP
specific update first, but at the moment it even sounds more like an
issue with strict migration than memory compaction.

Thanks,
Andrea

Miklos Szeredi

unread,
Dec 15, 2010, 9:55:16 AM12/15/10
to Andrea Arcangeli, mik...@szeredi.hu, da...@linux.vnet.ibm.com, linux-...@vger.kernel.org, linu...@kvack.org, linux-...@vger.kernel.org, shen...@cn.ibm.com, volo...@us.ibm.com, m...@linux.vnet.ibm.com, di...@cn.ibm.com, lnxn...@us.ibm.com

Right, it's questionable whether any page migration should wait for
I/O as it can introduce large delays, and even complete lockup of an
unrelated process (as in case of NFS server being offline).

The man page for move_pages() clearly defines I/O as an error
condition:

-EBUSY The page is currently busy and cannot be moved. Try again
later. This occurs if a page is undergoing I/O or another ker-
nel subsystem is holding a reference to the page.

yet the actual code waits for I/O, both read and write. That might be
OK with some timeouts. Page migration is best effort anyway, so
waiting forever on I/O makes little sense.

> How to fix I'm not sure yet
> as I didn't look into it closely as I was focusing on rolling a THP
> specific update first, but at the moment it even sounds more like an
> issue with strict migration than memory compaction.

Yes, this is a page migration issue. But the fact is, THP will make
it more visible exactly because it can be used without any special
configuration.

Thanks,
Miklos

0 new messages