Is it supposed to be ok to call del_gendisk while userspace is frozen?

Maxim Levitsky

unread,

Feb 13, 2010, 8:40:02 AM2/13/10

to

I noticed that currently calling del_gendisk leads to sure deadlock if
attemped from .suspend or .resume functions.

Something like that:

[<ffffffff8106620a>] ? prepare_to_wait+0x2a/0x90
[<ffffffff810790bd>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff8140db12>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
[<ffffffff8112a39e>] bdi_sched_wait+0xe/0x20
[<ffffffff8140af6f>] __wait_on_bit+0x5f/0x90
[<ffffffff8112a390>] ? bdi_sched_wait+0x0/0x20
[<ffffffff8140b018>] out_of_line_wait_on_bit+0x78/0x90
[<ffffffff81065fd0>] ? wake_bit_function+0x0/0x40
[<ffffffff8112a2d3>] ? bdi_queue_work+0xa3/0xe0
[<ffffffff8112a37f>] bdi_sync_writeback+0x6f/0x80
[<ffffffff8112a3d2>] sync_inodes_sb+0x22/0x120
[<ffffffff8112f1d2>] __sync_filesystem+0x82/0x90
[<ffffffff8112f3db>] sync_filesystem+0x4b/0x70
[<ffffffff811391de>] fsync_bdev+0x2e/0x60
[<ffffffff812226be>] invalidate_partition+0x2e/0x50
[<ffffffff8116b92f>] del_gendisk+0x3f/0x140
[<ffffffffa00c0233>] mmc_blk_remove+0x33/0x60 [mmc_block]
[<ffffffff81338977>] mmc_bus_remove+0x17/0x20
[<ffffffff812ce746>] __device_release_driver+0x66/0xc0
[<ffffffff812ce89d>] device_release_driver+0x2d/0x40
[<ffffffff812cd9b5>] bus_remove_device+0xb5/0x120
[<ffffffff812cb46f>] device_del+0x12f/0x1a0
[<ffffffff81338a5b>] mmc_remove_card+0x5b/0x90
[<ffffffff8133ac27>] mmc_sd_remove+0x27/0x50
[<ffffffff81337d8c>] mmc_resume_host+0x10c/0x140
[<ffffffffa00850e9>] sdhci_resume_host+0x69/0xa0 [sdhci]
[<ffffffffa0bdc39e>] sdhci_pci_resume+0x8e/0xb0 [sdhci_pci]

bdi_queue_work seems to be the problem.

Some device drivers need to remove their cards logically in .suspend,
because the card is removable, and can be changed while system is
suspended.

Best regards,
Maxim Levitsky

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Maxim Levitsky

unread,

Feb 15, 2010, 11:10:01 AM2/15/10

to

Any update?

Rafael J. Wysocki

unread,

Feb 15, 2010, 4:10:02 PM2/15/10

to

On Monday 15 February 2010, Maxim Levitsky wrote:
> On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > I noticed that currently calling del_gendisk leads to sure deadlock if
> > attemped from .suspend or .resume functions.

Well, it shouldn't be called from there, then.

I don't know how to resolve this right now.

Rafael

Alan Stern

unread,

Feb 16, 2010, 11:30:03 AM2/16/10

to

On Mon, 15 Feb 2010, Rafael J. Wysocki wrote:

> On Monday 15 February 2010, Maxim Levitsky wrote:
> > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > attemped from .suspend or .resume functions.
>
> Well, it shouldn't be called from there, then.

Even if drivers avoid calling it from within suspend methods, they have
to be able to call it from within resume methods. After all, the
resume method may find that the disk's device has vanished.

This is a matter for Jens. Is the bdi writeback task freezable? If it
is, should it be made unfreezable?

Alan Stern

Maxim Levitsky

unread,

Feb 20, 2010, 5:30:02 PM2/20/10

to

Any update?

Best regards,
Maxim Levitsky

--

Jens Axboe

unread,

Feb 23, 2010, 7:40:02 AM2/23/10

to

On Tue, Feb 16 2010, Alan Stern wrote:
> On Mon, 15 Feb 2010, Rafael J. Wysocki wrote:
>
> > On Monday 15 February 2010, Maxim Levitsky wrote:
> > > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > > attemped from .suspend or .resume functions.
> >
> > Well, it shouldn't be called from there, then.
>
> Even if drivers avoid calling it from within suspend methods, they have
> to be able to call it from within resume methods. After all, the
> resume method may find that the disk's device has vanished.

del_gendisk() needs process context at least, since it'll sleep (not
just for sync/invalidate, but other parts of the destruction as well).

I'm not a big expect on what tasks should be freezable or not. As it
stands, the writeback tasks will attempt to freeze and thaw with the
system. I guess that screws the sync from resume call, since it's not
running and the sync will wait for it to retrieve and finish that work
item.

To the suspend experts - can we safely mark the writeback tasks as
non-freezable?

--
Jens Axboe

Alan Stern

unread,

Feb 23, 2010, 10:30:02 AM2/23/10

to

On Tue, 23 Feb 2010, Jens Axboe wrote:

> On Tue, Feb 16 2010, Alan Stern wrote:
> > On Mon, 15 Feb 2010, Rafael J. Wysocki wrote:
> >
> > > On Monday 15 February 2010, Maxim Levitsky wrote:
> > > > On Sat, 2010-02-13 at 15:29 +0200, Maxim Levitsky wrote:
> > > > > I noticed that currently calling del_gendisk leads to sure deadlock if
> > > > > attemped from .suspend or .resume functions.
> > >
> > > Well, it shouldn't be called from there, then.
> >
> > Even if drivers avoid calling it from within suspend methods, they have
> > to be able to call it from within resume methods. After all, the
> > resume method may find that the disk's device has vanished.
>
> del_gendisk() needs process context at least, since it'll sleep (not
> just for sync/invalidate, but other parts of the destruction as well).

That's not a problem; suspend and resume run in process context.

> > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > is, should it be made unfreezable?
>
> I'm not a big expect on what tasks should be freezable or not. As it
> stands, the writeback tasks will attempt to freeze and thaw with the
> system. I guess that screws the sync from resume call, since it's not
> running and the sync will wait for it to retrieve and finish that work
> item.
>
> To the suspend experts - can we safely mark the writeback tasks as
> non-freezable?

The reason for freezing those tasks is to avoid writebacks at random
times during a system sleep transition, when the underlying device may
already be suspended, right?

In principle, a device's writeback task could be unfrozen immediately
after the device is resumed. In practice this might not solve the
problem, since the del_gendisk() call occurs _within_ the device's
resume routine. I suppose del_gendisk() could be made responsible for
unfreezing the writeback task.

The best solution would be to have del_gendisk() avoid waiting for the
writeback task in cases where the underlying device has been removed.
I don't know if that is feasible, however.

Alan Stern

P.S.: Jens, given a pointer to a struct gendisk or to a struct
request_queue, is there a good way to tell whether there any dirty
buffers for that device waiting to be written out? This is for
purposes of runtime power management -- in the initial implementation,
I want to avoid powering-down a block device if it is open or has any
dirty buffers. In other words, only completely idle devices should be
powered down (a good example would be a card reader with no memory card
inserted).

Jens Axboe

unread,

Feb 23, 2010, 11:00:02 AM2/23/10

to

On Tue, Feb 23 2010, Alan Stern wrote:
> > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > is, should it be made unfreezable?
> >
> > I'm not a big expect on what tasks should be freezable or not. As it
> > stands, the writeback tasks will attempt to freeze and thaw with the
> > system. I guess that screws the sync from resume call, since it's not
> > running and the sync will wait for it to retrieve and finish that work
> > item.
> >
> > To the suspend experts - can we safely mark the writeback tasks as
> > non-freezable?
>
> The reason for freezing those tasks is to avoid writebacks at random
> times during a system sleep transition, when the underlying device may
> already be suspended, right?

Right, or at least it would seem pointless to have them running while
the device is suspended. But my point was that if it's easier (and
feasible) to just leave them running, perhaps that was easier.

> In principle, a device's writeback task could be unfrozen immediately
> after the device is resumed. In practice this might not solve the
> problem, since the del_gendisk() call occurs _within_ the device's
> resume routine. I suppose del_gendisk() could be made responsible for
> unfreezing the writeback task.

And that's back to the question of whether or not that is a nice thing to
do. It seems a bit dirty, but otoh where else to do it. Perhaps just
using the kblockd to postpone the del_gendisk() to out-of-resume context
would be the best approach.

> The best solution would be to have del_gendisk() avoid waiting for the
> writeback task in cases where the underlying device has been removed.
> I don't know if that is feasible, however.

kblockd?

> P.S.: Jens, given a pointer to a struct gendisk or to a struct
> request_queue, is there a good way to tell whether there any dirty
> buffers for that device waiting to be written out? This is for
> purposes of runtime power management -- in the initial implementation,
> I want to avoid powering-down a block device if it is open or has any
> dirty buffers. In other words, only completely idle devices should be
> powered down (a good example would be a card reader with no memory card
> inserted).

There's no fool proof way. For most file systems I think you could get
away with checking the q->bdi dirty lists to see if there's anything
pending. But that wont work always, if the fs uses a different backing
dev info than then queue itself.

--
Jens Axboe

Alan Stern

unread,

Feb 23, 2010, 11:40:02 AM2/23/10

to

On Tue, 23 Feb 2010, Jens Axboe wrote:

> On Tue, Feb 23 2010, Alan Stern wrote:
> > > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > > is, should it be made unfreezable?
> > >
> > > I'm not a big expect on what tasks should be freezable or not. As it
> > > stands, the writeback tasks will attempt to freeze and thaw with the
> > > system. I guess that screws the sync from resume call, since it's not
> > > running and the sync will wait for it to retrieve and finish that work
> > > item.
> > >
> > > To the suspend experts - can we safely mark the writeback tasks as
> > > non-freezable?
> >
> > The reason for freezing those tasks is to avoid writebacks at random
> > times during a system sleep transition, when the underlying device may
> > already be suspended, right?
>
> Right, or at least it would seem pointless to have them running while
> the device is suspended. But my point was that if it's easier (and
> feasible) to just leave them running, perhaps that was easier.

I don't have a clear picture of how the block layer operates. For
example, what is the reason for this comment in the definition of
struct genhd?

struct device *driverfs_dev; // FIXME: remove

Isn't that crucial for making a disk show up in sysfs? Is the comment
out of date?

A possible approach is to add suspend and resume methods for this
driverfs_dev, and make them be responsible for stopping and restarting
the writeback task instead of relying on the freezer. Then
del_gendisk() could cleanly restart the task when necessary.

> > In principle, a device's writeback task could be unfrozen immediately
> > after the device is resumed. In practice this might not solve the
> > problem, since the del_gendisk() call occurs _within_ the device's
> > resume routine. I suppose del_gendisk() could be made responsible for
> > unfreezing the writeback task.
>
> And that's back to the question of whether or not that is a nice thing to
> do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> using the kblockd to postpone the del_gendisk() to out-of-resume context
> would be the best approach.

That would involve a layering violation, wouldn't it? Either the
driver would have to interface with kblockd directly, or else
del_gendisk() would need to know whether the writeback task was frozen.

On the whole, I think it's best for the block layer to retain full
control over its own tasks and requirements.

Alan Stern

unread,

Feb 23, 2010, 11:50:01 AM2/23/10

to

On Tue, 23 Feb 2010, Jens Axboe wrote:

> > P.S.: Jens, given a pointer to a struct gendisk or to a struct
> > request_queue, is there a good way to tell whether there any dirty
> > buffers for that device waiting to be written out? This is for
> > purposes of runtime power management -- in the initial implementation,
> > I want to avoid powering-down a block device if it is open or has any
> > dirty buffers. In other words, only completely idle devices should be
> > powered down (a good example would be a card reader with no memory card
> > inserted).
>
> There's no fool proof way. For most file systems I think you could get
> away with checking the q->bdi dirty lists to see if there's anything
> pending. But that wont work always, if the fs uses a different backing
> dev info than then queue itself.

That's not what I meant. Dirty buffers on a filesystem make no
difference because they always get written out when the filesystem is
unmounted. The device file remains open as long as the filesystem
is mounted, which would prevent the device from being powered down.

I was asking about dirty buffers on a block device that isn't holding a
filesystem -- where the raw device is being used directly for I/O.

Alan Stern

Jens Axboe

unread,

Feb 23, 2010, 5:20:01 PM2/23/10

to

On Tue, Feb 23 2010, Alan Stern wrote:
> On Tue, 23 Feb 2010, Jens Axboe wrote:
>
> > On Tue, Feb 23 2010, Alan Stern wrote:
> > > > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > > > is, should it be made unfreezable?
> > > >
> > > > I'm not a big expect on what tasks should be freezable or not. As it
> > > > stands, the writeback tasks will attempt to freeze and thaw with the
> > > > system. I guess that screws the sync from resume call, since it's not
> > > > running and the sync will wait for it to retrieve and finish that work
> > > > item.
> > > >
> > > > To the suspend experts - can we safely mark the writeback tasks as
> > > > non-freezable?
> > >
> > > The reason for freezing those tasks is to avoid writebacks at random
> > > times during a system sleep transition, when the underlying device may
> > > already be suspended, right?
> >
> > Right, or at least it would seem pointless to have them running while
> > the device is suspended. But my point was that if it's easier (and
> > feasible) to just leave them running, perhaps that was easier.
>
> I don't have a clear picture of how the block layer operates. For
> example, what is the reason for this comment in the definition of
> struct genhd?
>
> struct device *driverfs_dev; // FIXME: remove
>
> Isn't that crucial for making a disk show up in sysfs? Is the comment
> out of date?

Don't ask me, I'd suggest using git blame for finding out who wrote that
and ping them.

> A possible approach is to add suspend and resume methods for this
> driverfs_dev, and make them be responsible for stopping and restarting
> the writeback task instead of relying on the freezer. Then
> del_gendisk() could cleanly restart the task when necessary.

That sounds over engineered to me.

> > > In principle, a device's writeback task could be unfrozen immediately
> > > after the device is resumed. In practice this might not solve the
> > > problem, since the del_gendisk() call occurs _within_ the device's
> > > resume routine. I suppose del_gendisk() could be made responsible for
> > > unfreezing the writeback task.
> >
> > And that's back to the question of whether or not that is a nice thing to
> > do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> > using the kblockd to postpone the del_gendisk() to out-of-resume context
> > would be the best approach.
>
> That would involve a layering violation, wouldn't it? Either the
> driver would have to interface with kblockd directly, or else
> del_gendisk() would need to know whether the writeback task was frozen.
>
> On the whole, I think it's best for the block layer to retain full
> control over its own tasks and requirements.

You would export such functionality - del_gendisk_deferred(), or
something like that. The kblockd suggestion was implementation detail,
not something the driver would concern itself with. It's not exactly
picture perfect, but it could be used from eg resume context where the
device isn't fully live yet.

--
Jens Axboe

Jens Axboe

unread,

Feb 23, 2010, 5:20:02 PM2/23/10

to

On Tue, Feb 23 2010, Alan Stern wrote:

> On Tue, 23 Feb 2010, Jens Axboe wrote:
>
> > > P.S.: Jens, given a pointer to a struct gendisk or to a struct
> > > request_queue, is there a good way to tell whether there any dirty
> > > buffers for that device waiting to be written out? This is for
> > > purposes of runtime power management -- in the initial implementation,
> > > I want to avoid powering-down a block device if it is open or has any
> > > dirty buffers. In other words, only completely idle devices should be
> > > powered down (a good example would be a card reader with no memory card
> > > inserted).
> >
> > There's no fool proof way. For most file systems I think you could get
> > away with checking the q->bdi dirty lists to see if there's anything
> > pending. But that wont work always, if the fs uses a different backing
> > dev info than then queue itself.
>
> That's not what I meant. Dirty buffers on a filesystem make no
> difference because they always get written out when the filesystem is
> unmounted. The device file remains open as long as the filesystem
> is mounted, which would prevent the device from being powered down.
>
> I was asking about dirty buffers on a block device that isn't holding a
> filesystem -- where the raw device is being used directly for I/O.

OK, so just specifically the page cache of the device. Is that really
enough of an issue to warrant special checking? I mean, what normal
setup would even use buffer raw device access?

But if you wanted, I guess the only way would be to lookup
dirty/writeback pages on the bdev inode mapping. For that you'd need the
bdev, not the gendisk or the queue though.

--
Jens Axboe

Alan Stern

unread,

Feb 24, 2010, 11:00:02 AM2/24/10

to

On Tue, 23 Feb 2010, Jens Axboe wrote:

> > That's not what I meant. Dirty buffers on a filesystem make no
> > difference because they always get written out when the filesystem is
> > unmounted. The device file remains open as long as the filesystem
> > is mounted, which would prevent the device from being powered down.
> >
> > I was asking about dirty buffers on a block device that isn't holding a
> > filesystem -- where the raw device is being used directly for I/O.
>
> OK, so just specifically the page cache of the device. Is that really
> enough of an issue to warrant special checking? I mean, what normal
> setup would even use buffer raw device access?

Doesn't fdisk use it? There might be other applications too.

> But if you wanted, I guess the only way would be to lookup
> dirty/writeback pages on the bdev inode mapping. For that you'd need the
> bdev, not the gendisk or the queue though.

I can get the bdev from the gendisk by calling bdget_disk() with a
partition number of 0, right? What would the next step be? Would this
check for dirty pages associated with any of the partitions or would it
only look at pages associated with the inode for the entire disk?

Alan Stern

unread,

Feb 24, 2010, 11:00:02 AM2/24/10

to

Hmm. There's still no way for the driver to know whether or not the
writeback task is frozen when it wants to call del_gendisk(). It
would have to defer _all_ such calls. And all hot-pluggable block
drivers would have to do this -- would that be acceptable?

How about plugging the request queue instead of freezing the writeback
task? Would that work? It should be easy enough for a driver to
unplug the queue before unregistering its device from within a resume
method.

Alan Stern

Jens Axboe

unread,

Feb 24, 2010, 2:10:02 PM2/24/10

to

On Wed, Feb 24 2010, Alan Stern wrote:
> On Tue, 23 Feb 2010, Jens Axboe wrote:
>
> > > That's not what I meant. Dirty buffers on a filesystem make no
> > > difference because they always get written out when the filesystem is
> > > unmounted. The device file remains open as long as the filesystem
> > > is mounted, which would prevent the device from being powered down.
> > >
> > > I was asking about dirty buffers on a block device that isn't holding a
> > > filesystem -- where the raw device is being used directly for I/O.
> >
> > OK, so just specifically the page cache of the device. Is that really
> > enough of an issue to warrant special checking? I mean, what normal
> > setup would even use buffer raw device access?
>
> Doesn't fdisk use it? There might be other applications too.

It does, but that sound be a very short lived issue (since the dirty
buffers will get flushed).

> > But if you wanted, I guess the only way would be to lookup
> > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > bdev, not the gendisk or the queue though.
>
> I can get the bdev from the gendisk by calling bdget_disk() with a
> partition number of 0, right? What would the next step be? Would this
> check for dirty pages associated with any of the partitions or would it
> only look at pages associated with the inode for the entire disk?

It would cover the entire bdev.

--
Jens Axboe

Jens Axboe

unread,

Feb 24, 2010, 2:20:02 PM2/24/10

to

On Wed, Feb 24 2010, Alan Stern wrote:

> On Tue, 23 Feb 2010, Jens Axboe wrote:
>
> > > > And that's back to the question of whether or not that is a nice thing to
> > > > do. It seems a bit dirty, but otoh where else to do it. Perhaps just
> > > > using the kblockd to postpone the del_gendisk() to out-of-resume context
> > > > would be the best approach.
> > >
> > > That would involve a layering violation, wouldn't it? Either the
> > > driver would have to interface with kblockd directly, or else
> > > del_gendisk() would need to know whether the writeback task was frozen.
> > >
> > > On the whole, I think it's best for the block layer to retain full
> > > control over its own tasks and requirements.
> >
> > You would export such functionality - del_gendisk_deferred(), or
> > something like that. The kblockd suggestion was implementation detail,
> > not something the driver would concern itself with. It's not exactly
> > picture perfect, but it could be used from eg resume context where the
> > device isn't fully live yet.
>
> Hmm. There's still no way for the driver to know whether or not the
> writeback task is frozen when it wants to call del_gendisk(). It
> would have to defer _all_ such calls. And all hot-pluggable block
> drivers would have to do this -- would that be acceptable?

I was assuming it knew it was being called from a critical location,
like from resume. I guess the callback just iterates the bus devices and
calls the device remove, so that doesn't quite work without other
changes.

> How about plugging the request queue instead of freezing the writeback
> task? Would that work? It should be easy enough for a driver to
> unplug the queue before unregistering its device from within a resume
> method.

We have specific methods for either freezing of stopping or starting the
queue, perhaps those would be appropriate for suspend/resume actions. It
effectively prevents the queueing function from being called. If there
are dirty pages for the device, then it would not help though, as you
would still get stuck waiting for that IO to complete.

--
Jens Axboe

Alan Stern

unread,

Feb 24, 2010, 3:10:01 PM2/24/10

to

On Wed, 24 Feb 2010, Jens Axboe wrote:

> > > But if you wanted, I guess the only way would be to lookup
> > > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > > bdev, not the gendisk or the queue though.
> >
> > I can get the bdev from the gendisk by calling bdget_disk() with a
> > partition number of 0, right? What would the next step be? Would this
> > check for dirty pages associated with any of the partitions or would it
> > only look at pages associated with the inode for the entire disk?
>
> It would cover the entire bdev.

Okay, so once I've got the bdev, how do I look up the dirty/writeback
pages on the inode mapping?

Alan Stern

unread,

Feb 24, 2010, 3:20:01 PM2/24/10

to

On Wed, 24 Feb 2010, Jens Axboe wrote:

> > How about plugging the request queue instead of freezing the writeback
> > task? Would that work? It should be easy enough for a driver to
> > unplug the queue before unregistering its device from within a resume
> > method.
>
> We have specific methods for either freezing of stopping or starting the
> queue, perhaps those would be appropriate for suspend/resume actions. It
> effectively prevents the queueing function from being called. If there
> are dirty pages for the device, then it would not help though, as you
> would still get stuck waiting for that IO to complete.

If the resume method would restart the queue before unregistering the
device, pending dirty pages wouldn't cause any problems. They'd get
sent down to the driver and rejected immediately because the device was
dead or done.

The difficulty with this approach is that it requires individual
attention for each block device driver. Either the driver has to
freeze/stop/plug the queue during suspend (and restart it during
resume) or else the device's writeback task has to be frozen.

Can this be encapsulated by a function in the block layer? For
example, drivers could call blk_set_hot_unpluggable(bdev) for devices
that might need to be unregistered during resume. Then they would
become responsible for managing the device's queue.

Alan Stern

Jens Axboe

unread,

Feb 25, 2010, 3:30:01 AM2/25/10

to

On Wed, Feb 24 2010, Alan Stern wrote:

> On Wed, 24 Feb 2010, Jens Axboe wrote:
>
> > > > But if you wanted, I guess the only way would be to lookup
> > > > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > > > bdev, not the gendisk or the queue though.
> > >
> > > I can get the bdev from the gendisk by calling bdget_disk() with a
> > > partition number of 0, right? What would the next step be? Would this
> > > check for dirty pages associated with any of the partitions or would it
> > > only look at pages associated with the inode for the entire disk?
> >
> > It would cover the entire bdev.
>
> Okay, so once I've got the bdev, how do I look up the dirty/writeback
> pages on the inode mapping?

I _think_ you can get away with not doing a radix lookup for dirty
pages, just looking at the BDI_RECLAIMABLE stat on the bdi. That would
be:

bdi_stat(bdev->bd_inode->i_mapping->backing_dev_info, BDI_RECLAIMABLE);

--
Jens Axboe

Dave Chinner

unread,

Feb 25, 2010, 5:20:02 PM2/25/10

to

On Thu, Feb 25, 2010 at 09:20:35AM +0100, Jens Axboe wrote:
> On Wed, Feb 24 2010, Alan Stern wrote:
> > On Wed, 24 Feb 2010, Jens Axboe wrote:
> >
> > > > > But if you wanted, I guess the only way would be to lookup
> > > > > dirty/writeback pages on the bdev inode mapping. For that you'd need the
> > > > > bdev, not the gendisk or the queue though.
> > > >
> > > > I can get the bdev from the gendisk by calling bdget_disk() with a
> > > > partition number of 0, right? What would the next step be? Would this
> > > > check for dirty pages associated with any of the partitions or would it
> > > > only look at pages associated with the inode for the entire disk?
> > >
> > > It would cover the entire bdev.
> >
> > Okay, so once I've got the bdev, how do I look up the dirty/writeback
> > pages on the inode mapping?
>
> I _think_ you can get away with not doing a radix lookup for dirty
> pages, just looking at the BDI_RECLAIMABLE stat on the bdi. That would
> be:
>
> bdi_stat(bdev->bd_inode->i_mapping->backing_dev_info, BDI_RECLAIMABLE);

mapping_tagged(bdev->bd_inode->i_mapping, PAGECACHE_TAG_DIRTY);

is about as low overhead as it gets as the radix tree propagateѕ
tags back up to the root. i.e. no page lookups needed at all to
determine if it is dirty.

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Pavel Machek

unread,

Mar 1, 2010, 1:40:02 AM3/1/10

to

Hi!

> > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > is, should it be made unfreezable?
> >
> > I'm not a big expect on what tasks should be freezable or not. As it
> > stands, the writeback tasks will attempt to freeze and thaw with the
> > system. I guess that screws the sync from resume call, since it's not
> > running and the sync will wait for it to retrieve and finish that work
> > item.
> >
> > To the suspend experts - can we safely mark the writeback tasks as
> > non-freezable?
>
> The reason for freezing those tasks is to avoid writebacks at random
> times during a system sleep transition, when the underlying device may
> already be suspended, right?

It is also there to avoid inconsistency between in-filesystem data and
snapshot in hibernation image.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Alan Stern

unread,

Mar 1, 2010, 10:30:01 AM3/1/10

to

On Mon, 1 Mar 2010, Pavel Machek wrote:

> Hi!
>
> > > > This is a matter for Jens. Is the bdi writeback task freezable? If it
> > > > is, should it be made unfreezable?
> > >
> > > I'm not a big expect on what tasks should be freezable or not. As it
> > > stands, the writeback tasks will attempt to freeze and thaw with the
> > > system. I guess that screws the sync from resume call, since it's not
> > > running and the sync will wait for it to retrieve and finish that work
> > > item.
> > >
> > > To the suspend experts - can we safely mark the writeback tasks as
> > > non-freezable?
> >
> > The reason for freezing those tasks is to avoid writebacks at random
> > times during a system sleep transition, when the underlying device may
> > already be suspended, right?
>
> It is also there to avoid inconsistency between in-filesystem data and
> snapshot in hibernation image.

A good point, although in this case I think it won't matter. Writing
out a dirty page twice (once right after taking the snapshot and then
again after resuming from hibernation) will leave the disk in a correct
state.

Alan Stern

Pavel Machek

unread,

Mar 3, 2010, 5:00:03 PM3/3/10

to

Hi!

> > > The reason for freezing those tasks is to avoid writebacks at random
> > > times during a system sleep transition, when the underlying device may
> > > already be suspended, right?
> >
> > It is also there to avoid inconsistency between in-filesystem data and
> > snapshot in hibernation image.
>
> A good point, although in this case I think it won't matter. Writing
> out a dirty page twice (once right after taking the snapshot and then
> again after resuming from hibernation) will leave the disk in a correct
> state.

No, I don't think so. Have you considered all the various journalling
systems?

Definitely not in presence of I/O errors. Commit block can only be
written after previous blocks are successfully writen to the journal.

So lets see:

Write previous block, write commit block, write more blocks

Error writing previous block (block now contains garbage), leading to
kernel panic

journalling assumptions broken: commit block is there, but previous
blocks are not intact. Data loss.

...and that was the first I could think about. Lets not do
this. Barriers were invented for a reason.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Alan Stern

unread,

Mar 3, 2010, 5:30:03 PM3/3/10

to

Very well. Then we still need a solution to the original problem:
Devices sometimes need to be unregistered during resume, but
del_gendisk() blocks on the writeback thread, which is frozen until
after the resume finishes. How do you suggest this be fixed?

Alan Stern

Rafael J. Wysocki

unread,

Mar 3, 2010, 7:30:01 PM3/3/10

to

I thought about thawing the writeback thread earlier in such cases.

Would that makes sense / is it doable at all?

Rafael

Alan Stern

unread,

Mar 3, 2010, 9:50:01 PM3/3/10

to

On Thu, 4 Mar 2010, Rafael J. Wysocki wrote:

> > Very well. Then we still need a solution to the original problem:
> > Devices sometimes need to be unregistered during resume, but
> > del_gendisk() blocks on the writeback thread, which is frozen until
> > after the resume finishes. How do you suggest this be fixed?
>
> I thought about thawing the writeback thread earlier in such cases.
>
> Would that makes sense / is it doable at all?

My thought exactly. This is the only approach that also solves the
following race:

A driver is unloaded at the same time as a suspend starts.

The writeback thread gets frozen.

Then before the rmmod thread is frozen, it calls del_gendisk.

Delaying things by means of a workqueue (or the equivalent) might also
work, but it doesn't seem as safe. For example, some important
writebacks might end up getting delayed until too late.

Alan Stern

Pavel Machek

unread,

Mar 4, 2010, 9:00:02 AM3/4/10

to

Hi!

> > journalling assumptions broken: commit block is there, but previous
> > blocks are not intact. Data loss.
> >
> > ...and that was the first I could think about. Lets not do
> > this. Barriers were invented for a reason.
>
> Very well. Then we still need a solution to the original problem:
> Devices sometimes need to be unregistered during resume, but
> del_gendisk() blocks on the writeback thread, which is frozen until
> after the resume finishes. How do you suggest this be fixed?

Avoid unregistering device during resume. Instead, return errors until
resume is done and you can call del_gendisk?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Rafael J. Wysocki

unread,

Mar 4, 2010, 2:30:02 PM3/4/10

to

On Thursday 04 March 2010, Alan Stern wrote:
> On Thu, 4 Mar 2010, Rafael J. Wysocki wrote:
>
> > > Very well. Then we still need a solution to the original problem:
> > > Devices sometimes need to be unregistered during resume, but
> > > del_gendisk() blocks on the writeback thread, which is frozen until
> > > after the resume finishes. How do you suggest this be fixed?
> >
> > I thought about thawing the writeback thread earlier in such cases.
> >
> > Would that makes sense / is it doable at all?
>
> My thought exactly. This is the only approach that also solves the
> following race:
>
> A driver is unloaded at the same time as a suspend starts.
>
> The writeback thread gets frozen.
>
> Then before the rmmod thread is frozen, it calls del_gendisk.
>
> Delaying things by means of a workqueue (or the equivalent) might also
> work, but it doesn't seem as safe. For example, some important
> writebacks might end up getting delayed until too late.

OK, so what exactly should trigger the thawing of the writeback thread?

Should we do that unconditionally at one point or wait for a specific situation
to happen, and how to recognize that situation in the latter case?

Rafael

Alan Stern

unread,

Mar 4, 2010, 2:40:02 PM3/4/10

to

On Thu, 4 Mar 2010, Rafael J. Wysocki wrote:

> On Thursday 04 March 2010, Alan Stern wrote:
> > On Thu, 4 Mar 2010, Rafael J. Wysocki wrote:
> >
> > > > Very well. Then we still need a solution to the original problem:
> > > > Devices sometimes need to be unregistered during resume, but
> > > > del_gendisk() blocks on the writeback thread, which is frozen until
> > > > after the resume finishes. How do you suggest this be fixed?
> > >
> > > I thought about thawing the writeback thread earlier in such cases.
> > >
> > > Would that makes sense / is it doable at all?
> >
> > My thought exactly. This is the only approach that also solves the
> > following race:
> >
> > A driver is unloaded at the same time as a suspend starts.
> >
> > The writeback thread gets frozen.
> >
> > Then before the rmmod thread is frozen, it calls del_gendisk.
> >
> > Delaying things by means of a workqueue (or the equivalent) might also
> > work, but it doesn't seem as safe. For example, some important
> > writebacks might end up getting delayed until too late.
>
> OK, so what exactly should trigger the thawing of the writeback thread?
>
> Should we do that unconditionally at one point or wait for a specific situation
> to happen, and how to recognize that situation in the latter case?

My thought was that del_gendisk would do this unconditionally. It
would make the task unfreezable and then thaw it, to prevent a race.

This assumes that del_gendisk never gets called in the middle of a
sleep transition unless the device is really gone. If that's not the
case, Pavel's suggestion (make the resume routine do the device removal
asynchronously) might be better.

Alan Stern

Rafael J. Wysocki

unread,

Mar 4, 2010, 3:10:02 PM3/4/10

to

Well, I guess we can make it a rule.

Or use a flag that will be set to 'true' during resume (early) and will tell
del_gendisk whether to thaw the writeback thread.

Rafael

Pavel Machek

unread,

Mar 4, 2010, 3:20:01 PM3/4/10

to

Hi!

> > My thought exactly. This is the only approach that also solves the
> > following race:
> >
> > A driver is unloaded at the same time as a suspend starts.
> >
> > The writeback thread gets frozen.
> >
> > Then before the rmmod thread is frozen, it calls del_gendisk.
> >
> > Delaying things by means of a workqueue (or the equivalent) might also
> > work, but it doesn't seem as safe. For example, some important
> > writebacks might end up getting delayed until too late.

Delaying writebacks during sleep should be ok... That's why we do
sync() after userspace is frozen -- nothing really important should be
waiting for writeback after that point.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Matt Reimer

unread,

Apr 22, 2010, 7:50:02 PM4/22/10

to

On Thu, Mar 4, 2010 at 1:15 PM, Pavel Machek <pa...@ucw.cz> wrote:
> Hi!
>
>> > My thought exactly. This is the only approach that also solves the
>> > following race:
>> >
>> > A driver is unloaded at the same time as a suspend starts.
>> >
>> > The writeback thread gets frozen.
>> >
>> > Then before the rmmod thread is frozen, it calls del_gendisk.
>> >
>> > Delaying things by means of a workqueue (or the equivalent) might also
>> > work, but it doesn't seem as safe. For example, some important
>> > writebacks might end up getting delayed until too late.
>
> Delaying writebacks during sleep should be ok... That's why we do
> sync() after userspace is frozen -- nothing really important should be
> waiting for writeback after that point.

Has this been fixed, or has a consensus about how to fix this been
achieved? I'm hitting the same problem and have some time to work on a
fix.

Matt

Rafael J. Wysocki

unread,

Apr 23, 2010, 1:20:01 AM4/23/10

to

On Friday 23 April 2010, Matt Reimer wrote:
> On Thu, Mar 4, 2010 at 1:15 PM, Pavel Machek <pa...@ucw.cz> wrote:
> > Hi!
> >
> >> > My thought exactly. This is the only approach that also solves the
> >> > following race:
> >> >
> >> > A driver is unloaded at the same time as a suspend starts.
> >> >
> >> > The writeback thread gets frozen.
> >> >
> >> > Then before the rmmod thread is frozen, it calls del_gendisk.
> >> >
> >> > Delaying things by means of a workqueue (or the equivalent) might also
> >> > work, but it doesn't seem as safe. For example, some important
> >> > writebacks might end up getting delayed until too late.
> >
> > Delaying writebacks during sleep should be ok... That's why we do
> > sync() after userspace is frozen -- nothing really important should be
> > waiting for writeback after that point.
>
> Has this been fixed,

No, it hasn't.

> or has a consensus about how to fix this been
> achieved? I'm hitting the same problem and have some time to work on a
> fix.

Generally, it looks like del_gendisk should thaw writeback threads, but not
during suspend, only during resume.

Rafael

Matt Reimer

unread,

May 11, 2010, 8:00:01 PM5/11/10

to

On Thu, Apr 22, 2010 at 10:17 PM, Rafael J. Wysocki <r...@sisk.pl> wrote:
> On Friday 23 April 2010, Matt Reimer wrote:
>> On Thu, Mar 4, 2010 at 1:15 PM, Pavel Machek <pa...@ucw.cz> wrote:
>> > Hi!
>> >
>> >> > My thought exactly. �This is the only approach that also solves the
>> >> > following race:
>> >> >
>> >> > � � A driver is unloaded at the same time as a suspend starts.
>> >> >
>> >> > � � The writeback thread gets frozen.
>> >> >
>> >> > � � Then before the rmmod thread is frozen, it calls del_gendisk.
>> >> >
>> >> > Delaying things by means of a workqueue (or the equivalent) might also
>> >> > work, but it doesn't seem as safe. �For example, some important
>> >> > writebacks might end up getting delayed until too late.
>> >
>> > Delaying writebacks during sleep should be ok... That's why we do
>> > sync() after userspace is frozen -- nothing really important should be
>> > waiting for writeback after that point.
>>
>> Has this been fixed,
>
> No, it hasn't.
>
>> or has a consensus about how to fix this been
>> achieved? I'm hitting the same problem and have some time to work on a
>> fix.
>
> Generally, it looks like del_gendisk should thaw writeback threads, but not
> during suspend, only during resume.

Thawing the writeback thread only during resume does fix the case
Maxim originally presented:

0. build kernel with CONFIG_MMC_UNSAFE_RESUME
1. insert SD card
2. suspend
3. remove SD card while suspended
4. resume from suspend hangs

But if CONFIG_MMC_UNSAFE_RESUME is not set, the kernel oopses during
suspend because the MMC device suspend times out:

mmc0: card e624 removed
**** DPM device timeout: pxa2xx-mci.0 (pxa2xx-mci)
kernel BUG at /home/mreimer/sdg/android/android-2.1/kernel/drivers/base/power/main.c:453!
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c0004000
[00000000] *pgd=00000000
Internal error: Oops: 817 [#1] PREEMPT

If I thaw the writeback thread unconditionally in del_gendisk() then
suspend and resume work as expected for both CONFIG_MMC_UNSAFE_RESUME
set/not set, even when the card is removed while suspended.

So what is the proper fix?

Matt

Alan Stern

unread,

May 12, 2010, 11:00:02 AM5/12/10

to

I don't see any reason not to let del_gendisk thaw the writeback thread
during suspend. Since the device is going away anyhow, letting the
thread run shouldn't cause any problems.

Alan Stern

Matt Reimer

unread,

May 13, 2010, 5:50:01 PM5/13/10

to

So how does the attached patch look?

Matt

From 20d8340471eb05aa54af1349f4ddccecd9c230c6 Mon Sep 17 00:00:00 2001
From: Matt Reimer <mre...@sdgsystems.com>
Date: Thu, 13 May 2010 14:36:54 -0700
Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present

Devices can come and go from the MMC/SD bus during suspend or resume,
when the writeback thread is frozen, resulting in a hang. So thaw the
writeback thread in del_gendisk() to prevent the hang.

Signed-off-by: Matt Reimer <mre...@sdgsystems.com>
---
fs/partitions/check.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/partitions/check.c b/fs/partitions/check.c
index e238ab2..b303919 100644
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -666,6 +666,8 @@ void del_gendisk(struct gendisk *disk)
struct disk_part_iter piter;
struct hd_struct *part;

+ thaw_process(disk->queue->backing_dev_info.wb.task);
+
/* invalidate stuff */
disk_part_iter_init(&piter, disk,
DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
--
1.7.0.4

0001-fs-prevent-hang-on-suspend-resume-when-MMC-SD-card-p.patch

Alan Stern

unread,

May 13, 2010, 6:00:04 PM5/13/10

to

On Thu, 13 May 2010, Matt Reimer wrote:

> So how does the attached patch look?
>
> Matt
>
>
> From 20d8340471eb05aa54af1349f4ddccecd9c230c6 Mon Sep 17 00:00:00 2001
> From: Matt Reimer <mre...@sdgsystems.com>
> Date: Thu, 13 May 2010 14:36:54 -0700
> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>
> Devices can come and go from the MMC/SD bus during suspend or resume,
> when the writeback thread is frozen, resulting in a hang. So thaw the
> writeback thread in del_gendisk() to prevent the hang.

I don't see anything wrong with the patch itself, but I dislike the
description. Devices can come and go from any hotpluggable bus, not
just MMC/SD. That just happens to be the first place the problem was
observed.

Matt Reimer

unread,

May 13, 2010, 6:30:02 PM5/13/10

to

On Thu, May 13, 2010 at 2:54 PM, Alan Stern <st...@rowland.harvard.edu> wrote:
> On Thu, 13 May 2010, Matt Reimer wrote:
>
>> So how does the attached patch look?
>>
>> Matt
>>
>>
>> From 20d8340471eb05aa54af1349f4ddccecd9c230c6 Mon Sep 17 00:00:00 2001
>> From: Matt Reimer <mre...@sdgsystems.com>
>> Date: Thu, 13 May 2010 14:36:54 -0700
>> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>>
>> Devices can come and go from the MMC/SD bus during suspend or resume,
>> when the writeback thread is frozen, resulting in a hang. So thaw the
>> writeback thread in del_gendisk() to prevent the hang.
>
> I don't see anything wrong with the patch itself, but I dislike the
> description. Devices can come and go from any hotpluggable bus, not
> just MMC/SD. That just happens to be the first place the problem was
> observed.

Good point. How about this?

Matt

From 813bd223e5a2fa577b9e64ddf12654a93d0aab8b Mon Sep 17 00:00:00 2001

From: Matt Reimer <mre...@sdgsystems.com>
Date: Thu, 13 May 2010 14:36:54 -0700
Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present

Devices can come and go bus during suspend or resume, when the
writeback thread is frozen, resulting in a hang. Prevent the hang
by thawing the writeback thread in del_gendisk().

0001-fs-prevent-hang-on-suspend-resume-when-MMC-SD-card-p.patch

Nigel Cunningham

unread,

May 13, 2010, 6:50:02 PM5/13/10

to

Hi.

Why not just make it unfreezeable to start with?

Regards,

Nigel

Alan Stern

unread,

May 14, 2010, 10:40:01 PM5/14/10

to

On Thu, 13 May 2010, Matt Reimer wrote:

> > I don't see anything wrong with the patch itself, but I dislike the
> > description. �Devices can come and go from any hotpluggable bus, not
> > just MMC/SD. �That just happens to be the first place the problem was
> > observed.
>
> Good point. How about this?
>
> Matt
>
> From 813bd223e5a2fa577b9e64ddf12654a93d0aab8b Mon Sep 17 00:00:00 2001
> From: Matt Reimer <mre...@sdgsystems.com>
> Date: Thu, 13 May 2010 14:36:54 -0700
> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>
> Devices can come and go bus during suspend or resume, when the
> writeback thread is frozen, resulting in a hang. Prevent the hang
> by thawing the writeback thread in del_gendisk().

I would have said "the block layer's writeback thread", but this is
okay.

Alan Stern

Message has been deleted

Nigel Cunningham

unread,

May 14, 2010, 11:00:02 PM5/14/10

to

Hi.

On 15/05/10 12:37, Alan Stern wrote:
> On Fri, 14 May 2010, Nigel Cunningham wrote:
>
>> Hi.

>
>>> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>>>
>>> Devices can come and go bus during suspend or resume, when the
>>> writeback thread is frozen, resulting in a hang. Prevent the hang
>>> by thawing the writeback thread in del_gendisk().
>

>> Why not just make it unfreezeable to start with?
>

> If the writeback thread were unfreezable, it might wake up and try to
> write dirty pages back to disks after they were already suspended.
> That would not lead to good consequences...

If it syncs data as it should when we freeze processes, there won't be
any problem. Perhaps this is just an argument against making syncing
optional?

Regards,

Nigel

Rafael J. Wysocki

unread,

May 15, 2010, 7:50:01 PM5/15/10

to

On Saturday 15 May 2010, Alan Stern wrote:
> On Thu, 13 May 2010, Matt Reimer wrote:
>
> > > I don't see anything wrong with the patch itself, but I dislike the
> > > description. Devices can come and go from any hotpluggable bus, not
> > > just MMC/SD. That just happens to be the first place the problem was
> > > observed.
> >
> > Good point. How about this?
> >
> > Matt
> >
> > From 813bd223e5a2fa577b9e64ddf12654a93d0aab8b Mon Sep 17 00:00:00 2001
> > From: Matt Reimer <mre...@sdgsystems.com>
> > Date: Thu, 13 May 2010 14:36:54 -0700
> > Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
> >
> > Devices can come and go bus during suspend or resume, when the
> > writeback thread is frozen, resulting in a hang. Prevent the hang
> > by thawing the writeback thread in del_gendisk().
>
> I would have said "the block layer's writeback thread", but this is
> okay.

OK, so now I have a question who's going to take the patch.

Jens, what's your opinion?

Rafael

Nigel Cunningham

unread,

May 16, 2010, 3:50:01 AM5/16/10

to

Hi.

On 16/05/10 06:30, Rafael J. Wysocki wrote:
> On Saturday 15 May 2010, Alan Stern wrote:
>> On Thu, 13 May 2010, Matt Reimer wrote:
>>
>>>> I don't see anything wrong with the patch itself, but I dislike the
>>>> description. Devices can come and go from any hotpluggable bus, not
>>>> just MMC/SD. That just happens to be the first place the problem was
>>>> observed.
>>>
>>> Good point. How about this?
>>>
>>> Matt
>>>
>>> From 813bd223e5a2fa577b9e64ddf12654a93d0aab8b Mon Sep 17 00:00:00 2001
>>> From: Matt Reimer<mre...@sdgsystems.com>
>>> Date: Thu, 13 May 2010 14:36:54 -0700
>>> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
>>>
>>> Devices can come and go bus during suspend or resume, when the
>>> writeback thread is frozen, resulting in a hang. Prevent the hang
>>> by thawing the writeback thread in del_gendisk().
>>
>> I would have said "the block layer's writeback thread", but this is
>> okay.
>
> OK, so now I have a question who's going to take the patch.

I object to the patch.

Tell the patch it ought to exit once thawed, by all means.

Make the patch unfreezeable to begin with, by all means.

But don't go down the path of having $random_code_path unfreeze a
thread. That will lead to unpredictability, confusion and bugs.

If you know a disk is going to be unregistered during resume, use the
hooks early in the suspend / hibernate process to block new I/O and
flush what's already there so that there's no need to block on the
writeback thread, and/or no need to have the writeback thread frozen.

Regards,

Nigel

Rafael J. Wysocki

unread,

May 16, 2010, 3:40:01 PM5/16/10

to

On Sunday 16 May 2010, Nigel Cunningham wrote:
> Hi.
>
> On 16/05/10 06:30, Rafael J. Wysocki wrote:
> > On Saturday 15 May 2010, Alan Stern wrote:
> >> On Thu, 13 May 2010, Matt Reimer wrote:
> >>
> >>>> I don't see anything wrong with the patch itself, but I dislike the
> >>>> description. Devices can come and go from any hotpluggable bus, not
> >>>> just MMC/SD. That just happens to be the first place the problem was
> >>>> observed.
> >>>
> >>> Good point. How about this?
> >>>
> >>> Matt
> >>>
> >>> From 813bd223e5a2fa577b9e64ddf12654a93d0aab8b Mon Sep 17 00:00:00 2001
> >>> From: Matt Reimer<mre...@sdgsystems.com>
> >>> Date: Thu, 13 May 2010 14:36:54 -0700
> >>> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
> >>>
> >>> Devices can come and go bus during suspend or resume, when the
> >>> writeback thread is frozen, resulting in a hang. Prevent the hang
> >>> by thawing the writeback thread in del_gendisk().
> >>
> >> I would have said "the block layer's writeback thread", but this is
> >> okay.
> >
> > OK, so now I have a question who's going to take the patch.
>
> I object to the patch.
>
> Tell the patch it ought to exit once thawed, by all means.

I'm not sure what you mean. Care to explain?

> Make the patch unfreezeable to begin with, by all means.

That wouldn't work.

> But don't go down the path of having $random_code_path unfreeze a
> thread. That will lead to unpredictability, confusion and bugs.

As a general rule, I agree, but this particular case is somewhat special.

> If you know a disk is going to be unregistered during resume,

How do we check that, exactly?

> use the hooks early in the suspend / hibernate process to block new I/O and
> flush what's already there so that there's no need to block on the
> writeback thread, and/or no need to have the writeback thread frozen.

I'm not sure if that's realistic. Do you have a specific implementation in
mind?

Rafael

Rafael J. Wysocki

unread,

May 16, 2010, 3:40:01 PM5/16/10

to

On Saturday 15 May 2010, Nigel Cunningham wrote:
> Hi.
>
> On 15/05/10 12:37, Alan Stern wrote:
> > On Fri, 14 May 2010, Nigel Cunningham wrote:
> >
> >> Hi.
> >
> >>> Subject: [PATCH] fs: prevent hang on suspend/resume when MMC/SD card present
> >>>
> >>> Devices can come and go bus during suspend or resume, when the
> >>> writeback thread is frozen, resulting in a hang. Prevent the hang
> >>> by thawing the writeback thread in del_gendisk().
> >
> >> Why not just make it unfreezeable to start with?
> >
> > If the writeback thread were unfreezable, it might wake up and try to
> > write dirty pages back to disks after they were already suspended.
> > That would not lead to good consequences...
>
> If it syncs data as it should when we freeze processes, there won't be
> any problem. Perhaps this is just an argument against making syncing
> optional?

No, there is a problem. The writeback threads were made freezable after some
people had reported hangs during suspend that had been tracked down to that
issue. IIRC.

Rafael

Nigel Cunningham

unread,

May 16, 2010, 5:40:03 PM5/16/10

to

Hi.

I mean "Set up some sort of flag that it can look at once thawed at
resume time, and use that to tell it to exit at that point."

>> Make the patch unfreezeable to begin with, by all means.
>
> That wouldn't work.

Why not?

>> But don't go down the path of having $random_code_path unfreeze a
>> thread. That will lead to unpredictability, confusion and bugs.
>
> As a general rule, I agree, but this particular case is somewhat special.
>
>> If you know a disk is going to be unregistered during resume,
>
> How do we check that, exactly?

Well, if you can figure out that you need to go down this path at this
point in the process, you must be able to apply the same logic to come
to the same conclusion earlier in the process.

>> use the hooks early in the suspend / hibernate process to block new I/O and
>> flush what's already there so that there's no need to block on the
>> writeback thread, and/or no need to have the writeback thread frozen.
>
> I'm not sure if that's realistic. Do you have a specific implementation in
> mind?

No - just the conviction that there must be a way in which the logic
that says we need to call del_gendisk can be applied earlier to clean
things up at an earlier stage.

I'd love to look at it further, but I'm moving house on Wednesday, so
have a little bit of a pressing need to do some packing at the moment :)
That said, next week, life is going to be very different - I could look
at it then.

Regards,

Nigel

Alan Stern

unread,

May 16, 2010, 10:30:02 PM5/16/10

to

On Mon, 17 May 2010, Nigel Cunningham wrote:

> >> I object to the patch.
> >>
> >> Tell the patch it ought to exit once thawed, by all means.
> >
> > I'm not sure what you mean. Care to explain?
>
> I mean "Set up some sort of flag that it can look at once thawed at
> resume time, and use that to tell it to exit at that point."

Doesn't the patch do exactly that? The "flag" is set by virtue of the
fact that this is part of del_gendisk -- which means the disk is being
unregistered and hence the writeback thread will exit shortly.

> >> Make the patch unfreezeable to begin with, by all means.
> >
> > That wouldn't work.
>
> Why not?

It would be nice to know exactly why. Perhaps the underlying problem
can be fixed.

> >> If you know a disk is going to be unregistered during resume,
> >
> > How do we check that, exactly?
>
> Well, if you can figure out that you need to go down this path at this
> point in the process, you must be able to apply the same logic to come
> to the same conclusion earlier in the process.

That's not true. You don't know that a device is going to be unplugged
until it actually _is_ unplugged.

Alan Stern

Nigel Cunningham

unread,

May 17, 2010, 3:50:02 AM5/17/10

to

Hi.

On 17/05/10 12:22, Alan Stern wrote:
> On Mon, 17 May 2010, Nigel Cunningham wrote:
>
>>>> I object to the patch.
>>>>
>>>> Tell the patch it ought to exit once thawed, by all means.
>>>
>>> I'm not sure what you mean. Care to explain?
>>
>> I mean "Set up some sort of flag that it can look at once thawed at
>> resume time, and use that to tell it to exit at that point."
>
> Doesn't the patch do exactly that? The "flag" is set by virtue of the
> fact that this is part of del_gendisk -- which means the disk is being
> unregistered and hence the writeback thread will exit shortly.
>
>>>> Make the patch unfreezeable to begin with, by all means.
>>>
>>> That wouldn't work.
>>
>> Why not?
>
> It would be nice to know exactly why. Perhaps the underlying problem
> can be fixed.
>
>>>> If you know a disk is going to be unregistered during resume,
>>>
>>> How do we check that, exactly?
>>
>> Well, if you can figure out that you need to go down this path at this
>> point in the process, you must be able to apply the same logic to come
>> to the same conclusion earlier in the process.
>
> That's not true. You don't know that a device is going to be unplugged
> until it actually _is_ unplugged.

Sorry - I got unregistered during suspend (instead of resume) in my
head. That said, I'd argue that we should be...

1) Syncing all the data at the start of the suspend/hibernate, so
there's nothing for the workthread to do if we do del_gendisk.
2) Telling things to exit if we do find the device is gone away at
resume time, but not relying on the going-away happening until post
process thaw, for a couple of reasons:
- Potential for races/confusion/mess etc in having $random process
thawing other processes. Only the thread doing the suspend/hibernate
should be freezing/thawing.
- We're dealing with the symptom, not the cause. Almost always a bad idea.

Regards,

Nigel

Rafael J. Wysocki

unread,

May 17, 2010, 4:40:02 PM5/17/10

to

I don't see a problem here, as far as kernel threads are concerned. In this
particular case this is a subsystem thawing a thread that belongs to it. No
problem.

> - We're dealing with the symptom, not the cause. Almost always a bad idea.

I very much prefer to have a fix for a symptom than no fix at all, which is the
realistic alternative in this case.

So, I think we should merge the patch and if someone finds the root cause
at one point in future, then we can just use the *right* approach instead of
the present one.

The problem is real and people in the field are affected by it, so if you don't
have a working alternative patch, please just let go.

Thanks,
Rafael

Nigel Cunningham

unread,

May 17, 2010, 7:00:01 PM5/17/10

to

Hi.

I'm not denying that the problem is real. What I am concerned about is
finding a real solution, not just putting a sticky plaster over the
wound. It seems to me to be much wiser to deal with the issue properly
now instead of doing extra work later to diagnose what might be a harder
to reproduce symptom of the same problem. I'd happily put the time in
now myself, but I simply don't have the time this week.

Would it be possible to apply the patch, adding some sort of new tag
that can be used to say "This needs further attention", perhaps
including an enduring reference to this conversation. Later, the 'real'
fix could include another special tag that says "Proper fix for the
symptom addressed in commit 5e94f810"?

Regards,

Nigel

Rafael J. Wysocki

unread,

May 18, 2010, 3:50:02 PM5/18/10

to

Yeah, /* FIXME: */ is for that. With some comment why we're doing this. :-)

> Later, the 'real' fix could include another special tag that says "Proper fix for the
> symptom addressed in commit 5e94f810"?

Thanks,
Rafael

Alan Stern

unread,

May 18, 2010, 4:10:02 PM5/18/10

to

On Tue, 18 May 2010, Rafael J. Wysocki wrote:

> > > So, I think we should merge the patch and if someone finds the root cause
> > > at one point in future, then we can just use the *right* approach instead of
> > > the present one.
> > >
> > > The problem is real and people in the field are affected by it, so if you don't
> > > have a working alternative patch, please just let go.
> >
> > I'm not denying that the problem is real. What I am concerned about is
> > finding a real solution, not just putting a sticky plaster over the
> > wound. It seems to me to be much wiser to deal with the issue properly
> > now instead of doing extra work later to diagnose what might be a harder
> > to reproduce symptom of the same problem. I'd happily put the time in
> > now myself, but I simply don't have the time this week.
> >
> > Would it be possible to apply the patch, adding some sort of new tag
> > that can be used to say "This needs further attention", perhaps
> > including an enduring reference to this conversation.

Isn't there a problem involving filesystems with userspace components
(fuse and such)?

You can't sync these filesystems after userspace is frozen. And if you
sync everything first, there's a possibility that some more pages will
get marked dirty before all the threads are frozen.

The only safe course seems to be to freeze the writeback thread. But
Nigel is right; there is a race here. What happens if del_gendisk is
called just as the freezer is starting up? Clearly, del_gendisk needs
to mark the writeback thread as unfreezable in addition to thawing it.
And there has to be some synchronization -- a spinlock for example.

Alan Stern

Pavel Machek

unread,

May 24, 2010, 3:10:04 PM5/24/10

to

> >>- We're dealing with the symptom, not the cause. Almost always a bad idea.
> >
> >I very much prefer to have a fix for a symptom than no fix at all, which is the
> >realistic alternative in this case.
> >
> >So, I think we should merge the patch and if someone finds the root cause
> >at one point in future, then we can just use the *right* approach instead of
> >the present one.
> >
> >The problem is real and people in the field are affected by it, so if you don't
> >have a working alternative patch, please just let go.
>
> I'm not denying that the problem is real. What I am concerned about
> is finding a real solution, not just putting a sticky plaster over
> the wound. It seems to me to be much wiser to deal with the issue
> properly now instead of doing extra work later to diagnose what
> might be a harder to reproduce symptom of the same problem. I'd
> happily put the time in now myself, but I simply don't have the time
> this week.
>
> Would it be possible to apply the patch, adding some sort of new tag
> that can be used to say "This needs further attention", perhaps
> including an enduring reference to this conversation. Later, the
> 'real' fix could include another special tag that says "Proper fix
> for the symptom addressed in commit 5e94f810"?

WARN_ON() whenever patch triggers?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Nigel Cunningham

unread,

May 24, 2010, 5:30:03 PM5/24/10

to

Hi.

On 25/05/10 05:02, Pavel Machek wrote:
>
>>>> - We're dealing with the symptom, not the cause. Almost always a bad idea.
>>>
>>> I very much prefer to have a fix for a symptom than no fix at all, which is the
>>> realistic alternative in this case.
>>>
>>> So, I think we should merge the patch and if someone finds the root cause
>>> at one point in future, then we can just use the *right* approach instead of
>>> the present one.
>>>
>>> The problem is real and people in the field are affected by it, so if you don't
>>> have a working alternative patch, please just let go.
>>
>> I'm not denying that the problem is real. What I am concerned about
>> is finding a real solution, not just putting a sticky plaster over
>> the wound. It seems to me to be much wiser to deal with the issue
>> properly now instead of doing extra work later to diagnose what
>> might be a harder to reproduce symptom of the same problem. I'd
>> happily put the time in now myself, but I simply don't have the time
>> this week.
>>
>> Would it be possible to apply the patch, adding some sort of new tag
>> that can be used to say "This needs further attention", perhaps
>> including an enduring reference to this conversation. Later, the
>> 'real' fix could include another special tag that says "Proper fix
>> for the symptom addressed in commit 5e94f810"?
>
> WARN_ON() whenever patch triggers?

I suppose that would do. I was thinking of a more generic git tag that
could perhaps be searched for later, but .. okay.

Nigel

Maxim Levitsky

unread,

Jun 4, 2010, 7:30:02 AM6/4/10

to

On Thu, 2010-03-04 at 14:53 +0100, Pavel Machek wrote:
> Hi!
>
> > > journalling assumptions broken: commit block is there, but previous
> > > blocks are not intact. Data loss.
> > >
> > > ...and that was the first I could think about. Lets not do
> > > this. Barriers were invented for a reason.
> >
> > Very well. Then we still need a solution to the original problem:
> > Devices sometimes need to be unregistered during resume, but
> > del_gendisk() blocks on the writeback thread, which is frozen until
> > after the resume finishes. How do you suggest this be fixed?
>
> Avoid unregistering device during resume. Instead, return errors until
> resume is done and you can call del_gendisk?

This won't help ether. The same driver needs to unregister perfectly
working device on suspend, because the user might replace the card
during suspend and fool the os.
There is a setting, CONFIG_MMC_UNSAFE_RESUME and I use it, but it isn't
default.

Anyway to revive that old thread, how about introducing new
del_gendisk_no_sync?

A less safe version of del_gendisk, but which won't sync the filesystem.
Since driver knows that card is gone, there is no point of syncing it.

(the sync is done by invalidate_partition, so some flag should be
propagated to it).

Best regards,
Maxim Levitsky

Alan Stern

unread,

Jun 4, 2010, 11:00:02 AM6/4/10

to

On Fri, 4 Jun 2010, Maxim Levitsky wrote:

> On Thu, 2010-03-04 at 14:53 +0100, Pavel Machek wrote:
> > Hi!
> >
> > > > journalling assumptions broken: commit block is there, but previous
> > > > blocks are not intact. Data loss.
> > > >
> > > > ...and that was the first I could think about. Lets not do
> > > > this. Barriers were invented for a reason.
> > >
> > > Very well. Then we still need a solution to the original problem:
> > > Devices sometimes need to be unregistered during resume, but
> > > del_gendisk() blocks on the writeback thread, which is frozen until
> > > after the resume finishes. How do you suggest this be fixed?
> >
> > Avoid unregistering device during resume. Instead, return errors until
> > resume is done and you can call del_gendisk?
>
> This won't help ether. The same driver needs to unregister perfectly
> working device on suspend, because the user might replace the card
> during suspend and fool the os.
> There is a setting, CONFIG_MMC_UNSAFE_RESUME and I use it, but it isn't
> default.

People have generally agreed that the best answer is to have
del_gendisk always thaw the writeback thread.

> Anyway to revive that old thread, how about introducing new
> del_gendisk_no_sync?
>
> A less safe version of del_gendisk, but which won't sync the filesystem.
> Since driver knows that card is gone, there is no point of syncing it.
>
> (the sync is done by invalidate_partition, so some flag should be
> propagated to it).

That might work for mmc, but it wouldn't help other drivers subject to
the same problem.

Besides, it's subject to races. What if the card _isn't_ gone, but for
some other reason the driver wants to unregister the device at a time
when the writeback thread is frozen?

Alan Stern

Maxim Levitsky

unread,

Jun 4, 2010, 11:30:02 AM6/4/10

to

On Fri, 2010-06-04 at 10:59 -0400, Alan Stern wrote:
> On Fri, 4 Jun 2010, Maxim Levitsky wrote:
>
> > On Thu, 2010-03-04 at 14:53 +0100, Pavel Machek wrote:
> > > Hi!
> > >
> > > > > journalling assumptions broken: commit block is there, but previous
> > > > > blocks are not intact. Data loss.
> > > > >
> > > > > ...and that was the first I could think about. Lets not do
> > > > > this. Barriers were invented for a reason.
> > > >
> > > > Very well. Then we still need a solution to the original problem:
> > > > Devices sometimes need to be unregistered during resume, but
> > > > del_gendisk() blocks on the writeback thread, which is frozen until
> > > > after the resume finishes. How do you suggest this be fixed?
> > >
> > > Avoid unregistering device during resume. Instead, return errors until
> > > resume is done and you can call del_gendisk?
> >
> > This won't help ether. The same driver needs to unregister perfectly
> > working device on suspend, because the user might replace the card
> > during suspend and fool the os.
> > There is a setting, CONFIG_MMC_UNSAFE_RESUME and I use it, but it isn't
> > default.
>
> People have generally agreed that the best answer is to have
> del_gendisk always thaw the writeback thread.

Now the question is how to do that? :-)

Best regards,
Maxim Levitsky

Alan Stern

unread,

Jun 4, 2010, 2:00:01 PM6/4/10

to

On Fri, 4 Jun 2010, Maxim Levitsky wrote:

> On Fri, 2010-06-04 at 10:59 -0400, Alan Stern wrote:
> > On Fri, 4 Jun 2010, Maxim Levitsky wrote:
> >
> > > On Thu, 2010-03-04 at 14:53 +0100, Pavel Machek wrote:
> > > > Hi!
> > > >
> > > > > > journalling assumptions broken: commit block is there, but previous
> > > > > > blocks are not intact. Data loss.
> > > > > >
> > > > > > ...and that was the first I could think about. Lets not do
> > > > > > this. Barriers were invented for a reason.
> > > > >
> > > > > Very well. Then we still need a solution to the original problem:
> > > > > Devices sometimes need to be unregistered during resume, but
> > > > > del_gendisk() blocks on the writeback thread, which is frozen until
> > > > > after the resume finishes. How do you suggest this be fixed?
> > > >
> > > > Avoid unregistering device during resume. Instead, return errors until
> > > > resume is done and you can call del_gendisk?
> > >
> > > This won't help ether. The same driver needs to unregister perfectly
> > > working device on suspend, because the user might replace the card
> > > during suspend and fool the os.
> > > There is a setting, CONFIG_MMC_UNSAFE_RESUME and I use it, but it isn't
> > > default.
> >
> > People have generally agreed that the best answer is to have
> > del_gendisk always thaw the writeback thread.
> Now the question is how to do that? :-)

Here's a start:

http://marc.info/?l=linux-kernel&m=127378922620074&w=2

It's not quite right, because it needs to make the writeback thread
unfreezable before thawing it.