Something like that:
[<ffffffff8106620a>] ? prepare_to_wait+0x2a/0x90
[<ffffffff810790bd>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff8140db12>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
[<ffffffff8112a39e>] bdi_sched_wait+0xe/0x20
[<ffffffff8140af6f>] __wait_on_bit+0x5f/0x90
[<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
[<ffffffff8140b018>] out_of_line_wait_on_bit+0x78/0x90
[<ffffffff81065fd0>] ? wake_bit_function+0x0/0x40
[<ffffffff8112a2d3>] ? bdi_queue_work+0xa3/0xe0
[<ffffffff8112a37f>] bdi_sync_writeback+0x6f/0x80
[<ffffffff8112a3d2>] sync_inodes_sb+0x22/0x120
[<ffffffff8112f1d2>] __sync_filesystem+0x82/0x90
[<ffffffff8112f3db>] sync_filesystem+0x4b/0x70
[<ffffffff811391de>] fsync_bdev+0x2e/0x60
[<ffffffff812226be>] invalidate_partition+0x2e/0x50
[<ffffffff8116b92f>] del_gendisk+0x3f/0x140
[<ffffffffa00c0233>] mmc_blk_remove+0x33/0x60 [mmc_block]
[<ffffffff81338977>] mmc_bus_remove+0x17/0x20
[<ffffffff812ce746>] __device_release_driver+0x66/0xc0
[<ffffffff812ce89d>] device_release_driver+0x2d/0x40
[<ffffffff812cd9b5>] bus_remove_device+0xb5/0x120
[<ffffffff812cb46f>] device_del+0x12f/0x1a0
[<ffffffff81338a5b>] mmc_remove_card+0x5b/0x90
[<ffffffff8133ac27>] mmc_sd_remove+0x27/0x50
[<ffffffff81337d8c>] mmc_resume_host+0x10c/0x140
[<ffffffffa00850e9>] sdhci_resume_host+0x69/0xa0 [sdhci]
[<ffffffffa0bdc39e>] sdhci_pci_resume+0x8e/0xb0 [sdhci_pci]
bdi_queue_work seems to be the problem.
Some device drivers need to remove their cards logically in .suspend,
because the card is removable, and can be changed while system is
suspended.
Best regards,
Maxim Levitsky
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Any update?
Well, it shouldn't be called from there, then.
I don't know how to resolve this right now.
Rafael
> On Monday 15 February 2010, Maxim Levitsky wrote:
> > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > attemped from .suspend or .resume functions.
>
> Well, it shouldn't be called from there, then.
Even if drivers avoid calling it from within suspend methods, they have
to be able to call it from within resume methods. After all, the
resume method may find that the disk's device has vanished.
This is a matter for Jens. Is the bdi writeback task freezable? If it
is, should it be made unfreezable?
Alan Stern
Any update?
Best regards,
Maxim Levitsky
--
del_gendisk() needs process context at least, since it'll sleep (not
just for sync/invalidate, but other parts of the destruction as well).
I'm not a big expect on what tasks should be freezable or not. As it
stands, the writeback tasks will attempt to freeze and thaw with the
system. I guess that screws the sync from resume call, since it's not
running and the sync will wait for it to retrieve and finish that work
item.
To the suspend experts - can we safely mark the writeback tasks as
non-freezable?
--
Jens Axboe
> On Tue, Feb 16 2010, Alan Stern wrote:
> > On Mon, 15 Feb 2010, Rafael J. Wysocki wrote:
> >
> > > On Monday 15 February 2010, Maxim Levitsky wrote:
> > > > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > > > attemped from .suspend or .resume functions.
> > >
> > > Well, it shouldn't be called from there, then.
> >
> > Even if drivers avoid calling it from within suspend methods, they have
> > to be able to call it from within resume methods. After all, the
> > resume method may find that the disk's device has vanished.
>
> del_gendisk() needs process context at least, since it'll sleep (not
> just for sync/invalidate, but other parts of the destruction as well).
That's not a problem; suspend and resume run in process context.
> > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > is, should it be made unfreezable?
>
> I'm not a big expect on what tasks should be freezable or not. As it
> stands, the writeback tasks will attempt to freeze and thaw with the
> system. I guess that screws the sync from resume call, since it's not
> running and the sync will wait for it to retrieve and finish that work
> item.
>
> To the suspend experts - can we safely mark the writeback tasks as
> non-freezable?
The reason for freezing those tasks is to avoid writebacks at random
times during a system sleep transition, when the underlying device may
already be suspended, right?
In principle, a device's writeback task could be unfrozen immediately
after the device is resumed. In practice this might not solve the
problem, since the del_gendisk() call occurs _within_ the device's
resume routine. I suppose del_gendisk() could be made responsible for
unfreezing the writeback task.
The best solution would be to have del_gendisk() avoid waiting for the
writeback task in cases where the underlying device has been removed.
I don't know if that is feasible, however.
Alan Stern
P.S.: Jens, given a pointer to a struct gendisk or to a struct
request_queue, is there a good way to tell whether there any dirty
buffers for that device waiting to be written out? This is for
purposes of runtime power management -- in the initial implementation,
I want to avoid powering-down a block device if it is open or has any
dirty buffers. In other words, only completely idle devices should be
powered down (a good example would be a card reader with no memory card
inserted).
Right, or at least it would seem pointless to have them running while
the device is suspended. But my point was that if it's easier (and
feasible) to just leave them running, perhaps that was easier.
> In principle, a device's writeback task could be unfrozen immediately
> after the device is resumed. In practice this might not solve the
> problem, since the del_gendisk() call occurs _within_ the device's
> resume routine. I suppose del_gendisk() could be made responsible for
> unfreezing the writeback task.
And that's back to the question of whether or not that is a nice thing to
do. It seems a bit dirty, but otoh where else to do it. Perhaps just
using the kblockd to postpone the del_gendisk() to out-of-resume context
would be the best approach.
> The best solution would be to have del_gendisk() avoid waiting for the
> writeback task in cases where the underlying device has been removed.
> I don't know if that is feasible, however.
kblockd?
> P.S.: Jens, given a pointer to a struct gendisk or to a struct
> request_queue, is there a good way to tell whether there any dirty
> buffers for that device waiting to be written out? This is for
> purposes of runtime power management -- in the initial implementation,
> I want to avoid powering-down a block device if it is open or has any
> dirty buffers. In other words, only completely idle devices should be
> powered down (a good example would be a card reader with no memory card
> inserted).
There's no fool proof way. For most file systems I think you could get
away with checking the q->bdi dirty lists to see if there's anything
pending. But that wont work always, if the fs uses a different backing
dev info than then queue itself.
--
Jens Axboe
> On Tue, Feb 23 2010, Alan Stern wrote:
> > > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > > is, should it be made unfreezable?
> > >
> > > I'm not a big expect on what tasks should be freezable or not. As it
> > > stands, the writeback tasks will attempt to freeze and thaw with the
> > > system. I guess that screws the sync from resume call, since it's not
> > > running and the sync will wait for it to retrieve and finish that work
> > > item.
> > >
> > > To the suspend experts - can we safely mark the writeback tasks as
> > > non-freezable?
> >
> > The reason for freezing those tasks is to avoid writebacks at random
> > times during a system sleep transition, when the underlying device may
> > already be suspended, right?
>
> Right, or at least it would seem pointless to have them running while
> the device is suspended. But my point was that if it's easier (and
> feasible) to just leave them running, perhaps that was easier.
I don't have a clear picture of how the block layer operates. For
example, what is the reason for this comment in the definition of
struct genhd?
struct device *driverfs_dev; // FIXME: remove
Isn't that crucial for making a disk show up in sysfs? Is the comment
out of date?
A possible approach is to add suspend and resume methods for this
driverfs_dev, and make them be responsible for stopping and restarting
the writeback task instead of relying on the freezer. Then
del_gendisk() could cleanly restart the task when necessary.
> > In principle, a device's writeback task could be unfrozen immediately
> > after the device is resumed. In practice this might not solve the
> > problem, since the del_gendisk() call occurs _within_ the device's
> > resume routine. I suppose del_gendisk() could be made responsible for
> > unfreezing the writeback task.
>
> And that's back to the question of whether or not that is a nice thing to
> do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> using the kblockd to postpone the del_gendisk() to out-of-resume context
> would be the best approach.
That would involve a layering violation, wouldn't it? Either the
driver would have to interface with kblockd directly, or else
del_gendisk() would need to know whether the writeback task was frozen.
On the whole, I think it's best for the block layer to retain full
control over its own tasks and requirements.
Alan Stern
> > P.S.: Jens, given a pointer to a struct gendisk or to a struct
> > request_queue, is there a good way to tell whether there any dirty
> > buffers for that device waiting to be written out? This is for
> > purposes of runtime power management -- in the initial implementation,
> > I want to avoid powering-down a block device if it is open or has any
> > dirty buffers. In other words, only completely idle devices should be
> > powered down (a good example would be a card reader with no memory card
> > inserted).
>
> There's no fool proof way. For most file systems I think you could get
> away with checking the q->bdi dirty lists to see if there's anything
> pending. But that wont work always, if the fs uses a different backing
> dev info than then queue itself.
That's not what I meant. Dirty buffers on a filesystem make no
difference because they always get written out when the filesystem is
unmounted. The device file remains open as long as the filesystem
is mounted, which would prevent the device from being powered down.
I was asking about dirty buffers on a block device that isn't holding a
filesystem -- where the raw device is being used directly for I/O.
Alan Stern
Don't ask me, I'd suggest using git blame for finding out who wrote that
and ping them.
> A possible approach is to add suspend and resume methods for this
> driverfs_dev, and make them be responsible for stopping and restarting
> the writeback task instead of relying on the freezer. Then
> del_gendisk() could cleanly restart the task when necessary.
That sounds over engineered to me.
> > > In principle, a device's writeback task could be unfrozen immediately
> > > after the device is resumed. In practice this might not solve the
> > > problem, since the del_gendisk() call occurs _within_ the device's
> > > resume routine. I suppose del_gendisk() could be made responsible for
> > > unfreezing the writeback task.
> >
> > And that's back to the question of whether or not that is a nice thing to
> > do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> > using the kblockd to postpone the del_gendisk() to out-of-resume context
> > would be the best approach.
>
> That would involve a layering violation, wouldn't it? Either the
> driver would have to interface with kblockd directly, or else
> del_gendisk() would need to know whether the writeback task was frozen.
>
> On the whole, I think it's best for the block layer to retain full
> control over its own tasks and requirements.
You would export such functionality - del_gendisk_deferred(), or
something like that. The kblockd suggestion was implementation detail,
not something the driver would concern itself with. It's not exactly
picture perfect, but it could be used from eg resume context where the
device isn't fully live yet.
--
Jens Axboe
OK, so just specifically the page cache of the device. Is that really
enough of an issue to warrant special checking? I mean, what normal
setup would even use buffer raw device access?
But if you wanted, I guess the only way would be to lookup
dirty/writeback pages on the bdev inode mapping. For that you'd need the
bdev, not the gendisk or the queue though.
--
Jens Axboe
> > That's not what I meant. Dirty buffers on a filesystem make no
> > difference because they always get written out when the filesystem is
> > unmounted. The device file remains open as long as the filesystem
> > is mounted, which would prevent the device from being powered down.
> >
> > I was asking about dirty buffers on a block device that isn't holding a
> > filesystem -- where the raw device is being used directly for I/O.
>
> OK, so just specifically the page cache of the device. Is that really
> enough of an issue to warrant special checking? I mean, what normal
> setup would even use buffer raw device access?
Doesn't fdisk use it? There might be other applications too.
> But if you wanted, I guess the only way would be to lookup
> dirty/writeback pages on the bdev inode mapping. For that you'd need the
> bdev, not the gendisk or the queue though.
I can get the bdev from the gendisk by calling bdget_disk() with a
partition number of 0, right? What would the next step be? Would this
check for dirty pages associated with any of the partitions or would it
only look at pages associated with the inode for the entire disk?
Alan Stern
Hmm. There's still no way for the driver to know whether or not the
writeback task is frozen when it wants to call del_gendisk(). It
would have to defer _all_ such calls. And all hot-pluggable block
drivers would have to do this -- would that be acceptable?
How about plugging the request queue instead of freezing the writeback
task? Would that work? It should be easy enough for a driver to
unplug the queue before unregistering its device from within a resume
method.
Alan Stern
It does, but that sound be a very short lived issue (since the dirty
buffers will get flushed).
> > But if you wanted, I guess the only way would be to lookup
> > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > bdev, not the gendisk or the queue though.
>
> I can get the bdev from the gendisk by calling bdget_disk() with a
> partition number of 0, right? What would the next step be? Would this
> check for dirty pages associated with any of the partitions or would it
> only look at pages associated with the inode for the entire disk?
It would cover the entire bdev.
--
Jens Axboe
I was assuming it knew it was being called from a critical location,
like from resume. I guess the callback just iterates the bus devices and
calls the device remove, so that doesn't quite work without other
changes.
> How about plugging the request queue instead of freezing the writeback
> task? Would that work? It should be easy enough for a driver to
> unplug the queue before unregistering its device from within a resume
> method.
We have specific methods for either freezing of stopping or starting the
queue, perhaps those would be appropriate for suspend/resume actions. It
effectively prevents the queueing function from being called. If there
are dirty pages for the device, then it would not help though, as you
would still get stuck waiting for that IO to complete.
--
Jens Axboe
> > > But if you wanted, I guess the only way would be to lookup
> > > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > > bdev, not the gendisk or the queue though.
> >
> > I can get the bdev from the gendisk by calling bdget_disk() with a
> > partition number of 0, right? What would the next step be? Would this
> > check for dirty pages associated with any of the partitions or would it
> > only look at pages associated with the inode for the entire disk?
>
> It would cover the entire bdev.
Okay, so once I've got the bdev, how do I look up the dirty/writeback
pages on the inode mapping?
Alan Stern
> > How about plugging the request queue instead of freezing the writeback
> > task? Would that work? It should be easy enough for a driver to
> > unplug the queue before unregistering its device from within a resume
> > method.
>
> We have specific methods for either freezing of stopping or starting the
> queue, perhaps those would be appropriate for suspend/resume actions. It
> effectively prevents the queueing function from being called. If there
> are dirty pages for the device, then it would not help though, as you
> would still get stuck waiting for that IO to complete.
If the resume method would restart the queue before unregistering the
device, pending dirty pages wouldn't cause any problems. They'd get
sent down to the driver and rejected immediately because the device was
dead or done.
The difficulty with this approach is that it requires individual
attention for each block device driver. Either the driver has to
freeze/stop/plug the queue during suspend (and restart it during
resume) or else the device's writeback task has to be frozen.
Can this be encapsulated by a function in the block layer? For
example, drivers could call blk_set_hot_unpluggable(bdev) for devices
that might need to be unregistered during resume. Then they would
become responsible for managing the device's queue.
Alan Stern
I _think_ you can get away with not doing a radix lookup for dirty
pages, just looking at the BDI_RECLAIMABLE stat on the bdi. That would
be:
bdi_stat(bdev->bd_inode->i_mapping->backing_dev_info, BDI_RECLAIMABLE);
--
Jens Axboe
mapping_tagged(bdev->bd_inode->i_mapping, PAGECACHE_TAG_DIRTY);
is about as low overhead as it gets as the radix tree propagateѕ
tags back up to the root. i.e. no page lookups needed at all to
determine if it is dirty.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
> > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > is, should it be made unfreezable?
> >
> > I'm not a big expect on what tasks should be freezable or not. As it
> > stands, the writeback tasks will attempt to freeze and thaw with the
> > system. I guess that screws the sync from resume call, since it's not
> > running and the sync will wait for it to retrieve and finish that work
> > item.
> >
> > To the suspend experts - can we safely mark the writeback tasks as
> > non-freezable?
>
> The reason for freezing those tasks is to avoid writebacks at random
> times during a system sleep transition, when the underlying device may
> already be suspended, right?
It is also there to avoid inconsistency between in-filesystem data and
snapshot in hibernation image.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> Hi!
>
> > > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > > is, should it be made unfreezable?
> > >
> > > I'm not a big expect on what tasks should be freezable or not. As it
> > > stands, the writeback tasks will attempt to freeze and thaw with the
> > > system. I guess that screws the sync from resume call, since it's not
> > > running and the sync will wait for it to retrieve and finish that work
> > > item.
> > >
> > > To the suspend experts - can we safely mark the writeback tasks as
> > > non-freezable?
> >
> > The reason for freezing those tasks is to avoid writebacks at random
> > times during a system sleep transition, when the underlying device may
> > already be suspended, right?
>
> It is also there to avoid inconsistency between in-filesystem data and
> snapshot in hibernation image.
A good point, although in this case I think it won't matter. Writing
out a dirty page twice (once right after taking the snapshot and then
again after resuming from hibernation) will leave the disk in a correct
state.
Alan Stern
> > > The reason for freezing those tasks is to avoid writebacks at random
> > > times during a system sleep transition, when the underlying device may
> > > already be suspended, right?
> >
> > It is also there to avoid inconsistency between in-filesystem data and
> > snapshot in hibernation image.
>
> A good point, although in this case I think it won't matter. Writing
> out a dirty page twice (once right after taking the snapshot and then
> again after resuming from hibernation) will leave the disk in a correct
> state.
No, I don't think so. Have you considered all the various journalling
systems?
Definitely not in presence of I/O errors. Commit block can only be
written after previous blocks are successfully writen to the journal.
So lets see:
<snapshot>
Write previous block, write commit block, write more blocks
<hibernation powerdown, restart>
Error writing previous block (block now contains garbage), leading to
kernel panic
<restart>
journalling assumptions broken: commit block is there, but previous
blocks are not intact. Data loss.
...and that was the first I could think about. Lets not do
this. Barriers were invented for a reason.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Very well. Then we still need a solution to the original problem:
Devices sometimes need to be unregistered during resume, but
del_gendisk() blocks on the writeback thread, which is frozen until
after the resume finishes. How do you suggest this be fixed?
Alan Stern
I thought about thawing the writeback thread earlier in such cases.
Would that makes sense / is it doable at all?
Rafael
> > Very well. Then we still need a solution to the original problem:
> > Devices sometimes need to be unregistered during resume, but
> > del_gendisk() blocks on the writeback thread, which is frozen until
> > after the resume finishes. How do you suggest this be fixed?
>
> I thought about thawing the writeback thread earlier in such cases.
>
> Would that makes sense / is it doable at all?
My thought exactly. This is the only approach that also solves the
following race:
A driver is unloaded at the same time as a suspend starts.
The writeback thread gets frozen.
Then before the rmmod thread is frozen, it calls del_gendisk.
Delaying things by means of a workqueue (or the equivalent) might also
work, but it doesn't seem as safe. For example, some important
writebacks might end up getting delayed until too late.
Alan Stern
> > journalling assumptions broken: commit block is there, but previous
> > blocks are not intact. Data loss.
> >
> > ...and that was the first I could think about. Lets not do
> > this. Barriers were invented for a reason.
>
> Very well. Then we still need a solution to the original problem:
> Devices sometimes need to be unregistered during resume, but
> del_gendisk() blocks on the writeback thread, which is frozen until
> after the resume finishes. How do you suggest this be fixed?
Avoid unregistering device during resume. Instead, return errors until
resume is done and you can call del_gendisk?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
OK, so what exactly should trigger the thawing of the writeback thread?
Should we do that unconditionally at one point or wait for a specific situation
to happen, and how to recognize that situation in the latter case?
Rafael
> On Thursday 04 March 2010, Alan Stern wrote:
> > On Thu, 4 Mar 2010, Rafael J. Wysocki wrote:
> >
> > > > Very well. Then we still need a solution to the original problem:
> > > > Devices sometimes need to be unregistered during resume, but
> > > > del_gendisk() blocks on the writeback thread, which is frozen until
> > > > after the resume finishes. How do you suggest this be fixed?
> > >
> > > I thought about thawing the writeback thread earlier in such cases.
> > >
> > > Would that makes sense / is it doable at all?
> >
> > My thought exactly. This is the only approach that also solves the
> > following race:
> >
> > A driver is unloaded at the same time as a suspend starts.
> >
> > The writeback thread gets frozen.
> >
> > Then before the rmmod thread is frozen, it calls del_gendisk.
> >
> > Delaying things by means of a workqueue (or the equivalent) might also
> > work, but it doesn't seem as safe. For example, some important
> > writebacks might end up getting delayed until too late.
>
> OK, so what exactly should trigger the thawing of the writeback thread?
>
> Should we do that unconditionally at one point or wait for a specific situation
> to happen, and how to recognize that situation in the latter case?
My thought was that del_gendisk would do this unconditionally. It
would make the task unfreezable and then thaw it, to prevent a race.
This assumes that del_gendisk never gets called in the middle of a
sleep transition unless the device is really gone. If that's not the
case, Pavel's suggestion (make the resume routine do the device removal
asynchronously) might be better.
Alan Stern
Well, I guess we can make it a rule.
Or use a flag that will be set to 'true' during resume (early) and will tell
del_gendisk whether to thaw the writeback thread.
Rafael
> > My thought exactly. This is the only approach that also solves the
> > following race:
> >
> > A driver is unloaded at the same time as a suspend starts.
> >
> > The writeback thread gets frozen.
> >
> > Then before the rmmod thread is frozen, it calls del_gendisk.
> >
> > Delaying things by means of a workqueue (or the equivalent) might also
> > work, but it doesn't seem as safe. For example, some important
> > writebacks might end up getting delayed until too late.
Delaying writebacks during sleep should be ok... That's why we do
sync() after userspace is frozen -- nothing really important should be
waiting for writeback after that point.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Has this been fixed, or has a consensus about how to fix this been
achieved? I'm hitting the same problem and have some time to work on a
fix.
Matt
No, it hasn't.
> or has a consensus about how to fix this been
> achieved? I'm hitting the same problem and have some time to work on a
> fix.
Generally, it looks like del_gendisk should thaw writeback threads, but not
during suspend, only during resume.
Rafael
Thawing the writeback thread only during resume does fix the case
Maxim originally presented:
0. build kernel with CONFIG_MMC_UNSAFE_RESUME
1. insert SD card
2. suspend
3. remove SD card while suspended
4. resume from suspend hangs
But if CONFIG_MMC_UNSAFE_RESUME is not set, the kernel oopses during
suspend because the MMC device suspend times out:
mmc0: card e624 removed
**** DPM device timeout: pxa2xx-mci.0 (pxa2xx-mci)
kernel BUG at /home/mreimer/sdg/android/android-2.1/kernel/drivers/base/power/main.c:453!
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c0004000
[00000000] *pgd=00000000
Internal error: Oops: 817 [#1] PREEMPT
If I thaw the writeback thread unconditionally in del_gendisk() then
suspend and resume work as expected for both CONFIG_MMC_UNSAFE_RESUME
set/not set, even when the card is removed while suspended.
So what is the proper fix?
Matt
I don't see any reason not to let del_gendisk thaw the writeback thread
during suspend. Since the device is going away anyhow, letting the
thread run shouldn't cause any problems.
Alan Stern
So how does the attached patch look?
Matt
From 20d8340471eb05aa54af1349f4ddccecd9c230c6 Mon Sep 17 00:00:00 2001
From: Matt Reimer <mre...@sdgsystems.com>
Date: Thu, 13 May 2010 14:36:54 -0700
Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
Devices can come and go from the MMC/SD bus during suspend or resume,
when the writeback thread is frozen, resulting in a hang. So thaw the
writeback thread in del_gendisk() to prevent the hang.
Signed-off-by: Matt Reimer <mre...@sdgsystems.com>
---
fs/partitions/check.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index e238ab2..b303919 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -666,6 +666,8 @@ void del_gendisk(struct gendisk *disk)
struct disk_part_iter piter;
struct hd_struct *part;
+ thaw_process(disk->queue->backing_dev_info.wb.task);
+
/* invalidate stuff */
disk_part_iter_init(&piter, disk,
DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
--
1.7.0.4
> So how does the attached patch look?
>
> Matt
>
>
> From 20d8340471eb05aa54af1349f4ddccecd9c230c6 Mon Sep 17 00:00:00 2001
> From: Matt Reimer <mre...@sdgsystems.com>
> Date: Thu, 13 May 2010 14:36:54 -0700
> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>
> Devices can come and go from the MMC/SD bus during suspend or resume,
> when the writeback thread is frozen, resulting in a hang. So thaw the
> writeback thread in del_gendisk() to prevent the hang.
I don't see anything wrong with the patch itself, but I dislike the
description. Devices can come and go from any hotpluggable bus, not
just MMC/SD. That just happens to be the first place the problem was
observed.
Good point. How about this?
Matt
From 813bd223e5a2fa577b9e64ddf12654a93d0aab8b Mon Sep 17 00:00:00 2001
From: Matt Reimer <mre...@sdgsystems.com>
Date: Thu, 13 May 2010 14:36:54 -0700
Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
Devices can come and go bus during suspend or resume, when the
writeback thread is frozen, resulting in a hang. Prevent the hang
by thawing the writeback thread in del_gendisk().
Why not just make it unfreezeable to start with?
Regards,
Nigel
> > I don't see anything wrong with the patch itself, but I dislike the
> > description. �Devices can come and go from any hotpluggable bus, not
> > just MMC/SD. �That just happens to be the first place the problem was
> > observed.
>
> Good point. How about this?
>
> Matt
>
> From 813bd223e5a2fa577b9e64ddf12654a93d0aab8b Mon Sep 17 00:00:00 2001
> From: Matt Reimer <mre...@sdgsystems.com>
> Date: Thu, 13 May 2010 14:36:54 -0700
> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>
> Devices can come and go bus during suspend or resume, when the
> writeback thread is frozen, resulting in a hang. Prevent the hang
> by thawing the writeback thread in del_gendisk().
I would have said "the block layer's writeback thread", but this is
okay.
Alan Stern
On 15/05/10 12:37, Alan Stern wrote:
> On Fri, 14 May 2010, Nigel Cunningham wrote:
>
>> Hi.
>
>>> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>>>
>>> Devices can come and go bus during suspend or resume, when the
>>> writeback thread is frozen, resulting in a hang. Prevent the hang
>>> by thawing the writeback thread in del_gendisk().
>
>> Why not just make it unfreezeable to start with?
>
> If the writeback thread were unfreezable, it might wake up and try to
> write dirty pages back to disks after they were already suspended.
> That would not lead to good consequences...
If it syncs data as it should when we freeze processes, there won't be
any problem. Perhaps this is just an argument against making syncing
optional?
Regards,
Nigel
OK, so now I have a question who's going to take the patch.
Jens, what's your opinion?
Rafael
On 16/05/10 06:30, Rafael J. Wysocki wrote:
> On Saturday 15 May 2010, Alan Stern wrote:
>> On Thu, 13 May 2010, Matt Reimer wrote:
>>
>>>> I don't see anything wrong with the patch itself, but I dislike the
>>>> description. Devices can come and go from any hotpluggable bus, not
>>>> just MMC/SD. That just happens to be the first place the problem was
>>>> observed.
>>>
>>> Good point. How about this?
>>>
>>> Matt
>>>
>>> From 813bd223e5a2fa577b9e64ddf12654a93d0aab8b Mon Sep 17 00:00:00 2001
>>> From: Matt Reimer<mre...@sdgsystems.com>
>>> Date: Thu, 13 May 2010 14:36:54 -0700
>>> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>>>
>>> Devices can come and go bus during suspend or resume, when the
>>> writeback thread is frozen, resulting in a hang. Prevent the hang
>>> by thawing the writeback thread in del_gendisk().
>>
>> I would have said "the block layer's writeback thread", but this is
>> okay.
>
> OK, so now I have a question who's going to take the patch.
I object to the patch.
Tell the patch it ought to exit once thawed, by all means.
Make the patch unfreezeable to begin with, by all means.
But don't go down the path of having $random_code_path unfreeze a
thread. That will lead to unpredictability, confusion and bugs.
If you know a disk is going to be unregistered during resume, use the
hooks early in the suspend / hibernate process to block new I/O and
flush what's already there so that there's no need to block on the
writeback thread, and/or no need to have the writeback thread frozen.
Regards,
Nigel
I'm not sure what you mean. Care to explain?
> Make the patch unfreezeable to begin with, by all means.
That wouldn't work.
> But don't go down the path of having $random_code_path unfreeze a
> thread. That will lead to unpredictability, confusion and bugs.
As a general rule, I agree, but this particular case is somewhat special.
> If you know a disk is going to be unregistered during resume,
How do we check that, exactly?
> use the hooks early in the suspend / hibernate process to block new I/O and
> flush what's already there so that there's no need to block on the
> writeback thread, and/or no need to have the writeback thread frozen.
I'm not sure if that's realistic. Do you have a specific implementation in
mind?
Rafael
No, there is a problem. The writeback threads were made freezable after some
people had reported hangs during suspend that had been tracked down to that
issue. IIRC.
Rafael
I mean "Set up some sort of flag that it can look at once thawed at
resume time, and use that to tell it to exit at that point."
>> Make the patch unfreezeable to begin with, by all means.
>
> That wouldn't work.
Why not?
>> But don't go down the path of having $random_code_path unfreeze a
>> thread. That will lead to unpredictability, confusion and bugs.
>
> As a general rule, I agree, but this particular case is somewhat special.
>
>> If you know a disk is going to be unregistered during resume,
>
> How do we check that, exactly?
Well, if you can figure out that you need to go down this path at this
point in the process, you must be able to apply the same logic to come
to the same conclusion earlier in the process.
>> use the hooks early in the suspend / hibernate process to block new I/O and
>> flush what's already there so that there's no need to block on the
>> writeback thread, and/or no need to have the writeback thread frozen.
>
> I'm not sure if that's realistic. Do you have a specific implementation in
> mind?
No - just the conviction that there must be a way in which the logic
that says we need to call del_gendisk can be applied earlier to clean
things up at an earlier stage.
I'd love to look at it further, but I'm moving house on Wednesday, so
have a little bit of a pressing need to do some packing at the moment :)
That said, next week, life is going to be very different - I could look
at it then.
Regards,
Nigel
> >> I object to the patch.
> >>
> >> Tell the patch it ought to exit once thawed, by all means.
> >
> > I'm not sure what you mean. Care to explain?
>
> I mean "Set up some sort of flag that it can look at once thawed at
> resume time, and use that to tell it to exit at that point."
Doesn't the patch do exactly that? The "flag" is set by virtue of the
fact that this is part of del_gendisk -- which means the disk is being
unregistered and hence the writeback thread will exit shortly.
> >> Make the patch unfreezeable to begin with, by all means.
> >
> > That wouldn't work.
>
> Why not?
It would be nice to know exactly why. Perhaps the underlying problem
can be fixed.
> >> If you know a disk is going to be unregistered during resume,
> >
> > How do we check that, exactly?
>
> Well, if you can figure out that you need to go down this path at this
> point in the process, you must be able to apply the same logic to come
> to the same conclusion earlier in the process.
That's not true. You don't know that a device is going to be unplugged
until it actually _is_ unplugged.
Alan Stern
On 17/05/10 12:22, Alan Stern wrote:
> On Mon, 17 May 2010, Nigel Cunningham wrote:
>
>>>> I object to the patch.
>>>>
>>>> Tell the patch it ought to exit once thawed, by all means.
>>>
>>> I'm not sure what you mean. Care to explain?
>>
>> I mean "Set up some sort of flag that it can look at once thawed at
>> resume time, and use that to tell it to exit at that point."
>
> Doesn't the patch do exactly that? The "flag" is set by virtue of the
> fact that this is part of del_gendisk -- which means the disk is being
> unregistered and hence the writeback thread will exit shortly.
>
>>>> Make the patch unfreezeable to begin with, by all means.
>>>
>>> That wouldn't work.
>>
>> Why not?
>
> It would be nice to know exactly why. Perhaps the underlying problem
> can be fixed.
>
>>>> If you know a disk is going to be unregistered during resume,
>>>
>>> How do we check that, exactly?
>>
>> Well, if you can figure out that you need to go down this path at this
>> point in the process, you must be able to apply the same logic to come
>> to the same conclusion earlier in the process.
>
> That's not true. You don't know that a device is going to be unplugged
> until it actually _is_ unplugged.
Sorry - I got unregistered during suspend (instead of resume) in my
head. That said, I'd argue that we should be...
1) Syncing all the data at the start of the suspend/hibernate, so
there's nothing for the workthread to do if we do del_gendisk.
2) Telling things to exit if we do find the device is gone away at
resume time, but not relying on the going-away happening until post
process thaw, for a couple of reasons:
- Potential for races/confusion/mess etc in having $random process
thawing other processes. Only the thread doing the suspend/hibernate
should be freezing/thawing.
- We're dealing with the symptom, not the cause. Almost always a bad idea.
Regards,
Nigel
I don't see a problem here, as far as kernel threads are concerned. In this
particular case this is a subsystem thawing a thread that belongs to it. No
problem.
> - We're dealing with the symptom, not the cause. Almost always a bad idea.
I very much prefer to have a fix for a symptom than no fix at all, which is the
realistic alternative in this case.
So, I think we should merge the patch and if someone finds the root cause
at one point in future, then we can just use the *right* approach instead of
the present one.
The problem is real and people in the field are affected by it, so if you don't
have a working alternative patch, please just let go.
Thanks,
Rafael
I'm not denying that the problem is real. What I am concerned about is
finding a real solution, not just putting a sticky plaster over the
wound. It seems to me to be much wiser to deal with the issue properly
now instead of doing extra work later to diagnose what might be a harder
to reproduce symptom of the same problem. I'd happily put the time in
now myself, but I simply don't have the time this week.
Would it be possible to apply the patch, adding some sort of new tag
that can be used to say "This needs further attention", perhaps
including an enduring reference to this conversation. Later, the 'real'
fix could include another special tag that says "Proper fix for the
symptom addressed in commit 5e94f810"?
Regards,
Nigel
Yeah, /* FIXME: */ is for that. With some comment why we're doing this. :-)
> Later, the 'real' fix could include another special tag that says "Proper fix for the
> symptom addressed in commit 5e94f810"?
Thanks,
Rafael
> > > So, I think we should merge the patch and if someone finds the root cause
> > > at one point in future, then we can just use the *right* approach instead of
> > > the present one.
> > >
> > > The problem is real and people in the field are affected by it, so if you don't
> > > have a working alternative patch, please just let go.
> >
> > I'm not denying that the problem is real. What I am concerned about is
> > finding a real solution, not just putting a sticky plaster over the
> > wound. It seems to me to be much wiser to deal with the issue properly
> > now instead of doing extra work later to diagnose what might be a harder
> > to reproduce symptom of the same problem. I'd happily put the time in
> > now myself, but I simply don't have the time this week.
> >
> > Would it be possible to apply the patch, adding some sort of new tag
> > that can be used to say "This needs further attention", perhaps
> > including an enduring reference to this conversation.
Isn't there a problem involving filesystems with userspace components
(fuse and such)?
You can't sync these filesystems after userspace is frozen. And if you
sync everything first, there's a possibility that some more pages will
get marked dirty before all the threads are frozen.
The only safe course seems to be to freeze the writeback thread. But
Nigel is right; there is a race here. What happens if del_gendisk is
called just as the freezer is starting up? Clearly, del_gendisk needs
to mark the writeback thread as unfreezable in addition to thawing it.
And there has to be some synchronization -- a spinlock for example.
Alan Stern
WARN_ON() whenever patch triggers?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On 25/05/10 05:02, Pavel Machek wrote:
>
>>>> - We're dealing with the symptom, not the cause. Almost always a bad idea.
>>>
>>> I very much prefer to have a fix for a symptom than no fix at all, which is the
>>> realistic alternative in this case.
>>>
>>> So, I think we should merge the patch and if someone finds the root cause
>>> at one point in future, then we can just use the *right* approach instead of
>>> the present one.
>>>
>>> The problem is real and people in the field are affected by it, so if you don't
>>> have a working alternative patch, please just let go.
>>
>> I'm not denying that the problem is real. What I am concerned about
>> is finding a real solution, not just putting a sticky plaster over
>> the wound. It seems to me to be much wiser to deal with the issue
>> properly now instead of doing extra work later to diagnose what
>> might be a harder to reproduce symptom of the same problem. I'd
>> happily put the time in now myself, but I simply don't have the time
>> this week.
>>
>> Would it be possible to apply the patch, adding some sort of new tag
>> that can be used to say "This needs further attention", perhaps
>> including an enduring reference to this conversation. Later, the
>> 'real' fix could include another special tag that says "Proper fix
>> for the symptom addressed in commit 5e94f810"?
>
> WARN_ON() whenever patch triggers?
I suppose that would do. I was thinking of a more generic git tag that
could perhaps be searched for later, but .. okay.
Nigel
This won't help ether. The same driver needs to unregister perfectly
working device on suspend, because the user might replace the card
during suspend and fool the os.
There is a setting, CONFIG_MMC_UNSAFE_RESUME and I use it, but it isn't
default.
Anyway to revive that old thread, how about introducing new
del_gendisk_no_sync?
A less safe version of del_gendisk, but which won't sync the filesystem.
Since driver knows that card is gone, there is no point of syncing it.
(the sync is done by invalidate_partition, so some flag should be
propagated to it).
Best regards,
Maxim Levitsky
> On Thu, 2010-03-04 at 14:53 +0100, Pavel Machek wrote:
> > Hi!
> >
> > > > journalling assumptions broken: commit block is there, but previous
> > > > blocks are not intact. Data loss.
> > > >
> > > > ...and that was the first I could think about. Lets not do
> > > > this. Barriers were invented for a reason.
> > >
> > > Very well. Then we still need a solution to the original problem:
> > > Devices sometimes need to be unregistered during resume, but
> > > del_gendisk() blocks on the writeback thread, which is frozen until
> > > after the resume finishes. How do you suggest this be fixed?
> >
> > Avoid unregistering device during resume. Instead, return errors until
> > resume is done and you can call del_gendisk?
>
> This won't help ether. The same driver needs to unregister perfectly
> working device on suspend, because the user might replace the card
> during suspend and fool the os.
> There is a setting, CONFIG_MMC_UNSAFE_RESUME and I use it, but it isn't
> default.
People have generally agreed that the best answer is to have
del_gendisk always thaw the writeback thread.
> Anyway to revive that old thread, how about introducing new
> del_gendisk_no_sync?
>
> A less safe version of del_gendisk, but which won't sync the filesystem.
> Since driver knows that card is gone, there is no point of syncing it.
>
> (the sync is done by invalidate_partition, so some flag should be
> propagated to it).
That might work for mmc, but it wouldn't help other drivers subject to
the same problem.
Besides, it's subject to races. What if the card _isn't_ gone, but for
some other reason the driver wants to unregister the device at a time
when the writeback thread is frozen?
Alan Stern
Best regards,
Maxim Levitsky
> On Fri, 2010-06-04 at 10:59 -0400, Alan Stern wrote:
> > On Fri, 4 Jun 2010, Maxim Levitsky wrote:
> >
> > > On Thu, 2010-03-04 at 14:53 +0100, Pavel Machek wrote:
> > > > Hi!
> > > >
> > > > > > journalling assumptions broken: commit block is there, but previous
> > > > > > blocks are not intact. Data loss.
> > > > > >
> > > > > > ...and that was the first I could think about. Lets not do
> > > > > > this. Barriers were invented for a reason.
> > > > >
> > > > > Very well. Then we still need a solution to the original problem:
> > > > > Devices sometimes need to be unregistered during resume, but
> > > > > del_gendisk() blocks on the writeback thread, which is frozen until
> > > > > after the resume finishes. How do you suggest this be fixed?
> > > >
> > > > Avoid unregistering device during resume. Instead, return errors until
> > > > resume is done and you can call del_gendisk?
> > >
> > > This won't help ether. The same driver needs to unregister perfectly
> > > working device on suspend, because the user might replace the card
> > > during suspend and fool the os.
> > > There is a setting, CONFIG_MMC_UNSAFE_RESUME and I use it, but it isn't
> > > default.
> >
> > People have generally agreed that the best answer is to have
> > del_gendisk always thaw the writeback thread.
> Now the question is how to do that? :-)
Here's a start:
http://marc.info/?l=linux-kernel&m=127378922620074&w=2
It's not quite right, because it needs to make the writeback thread
unfreezable before thawing it.