[GIT PULL] PM updates for 2.6.33

Rafael J. Wysocki

unread,

Dec 5, 2009, 4:20:02 PM12/5/09

to

Hi Linus,

Please pull power management updates for 2.6.33 from:

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6.git for-linus

They include:

* Asynchronous suspend and resume infrastructure. For now, PCI, ACPI and
serio devices are enabled to suspend and resume asynchronously.

* Fixes for the runtime PM framework.

* Hibernate cleanups from Nigel and Jiri Slaby.

* Freezer optimisation from Tejun.

---------------

Alan Stern (2):
PM / Runtime: Export the PM runtime workqueue
PM / Runtime: Use deferred_resume flag in pm_request_resume

Jaswinder Singh Rajput (1):
PM: Fix kernel-doc notation

Jiri Slaby (1):
PM / Hibernate: Swap, use KERN_CONT

Nigel Cunningham (2):
PM / Hibernate: Move swap functions to kernel/power/swap.c.
PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c

Rafael J. Wysocki (15):
PM: Introduce PM links framework
PM: Asynchronous resume of devices
PM: Asynchronous suspend of devices
PM: Allow PCI devices to suspend/resume asynchronously
PM: Allow ACPI devices to suspend/resume asynchronously
PM: Add a switch for disabling/enabling asynchronous suspend/resume
PM: Measure device suspend and resume times
PM: Add facility for advanced testing of async suspend/resume
PM: Measure suspend and resume times for individual devices
PM: Allow serio input devices to suspend/resume asynchronously
PM / Runtime: Fix lockdep warning in __pm_runtime_set_status()
PM / Runtime: Ensure timer_expires is nonzero in pm_schedule_suspend()
PM / Runtime: Make documentation of runtime_idle() agree with the code
PM / Runtime: Remove unnecessary braces in __pm_runtime_set_status()
PM: Add flag for devices capable of generating run-time wake-up events

Stephen Rothwell (1):
PM / Suspend: Using TASK_ macros requires sched.h

Tejun Heo (1):
PM / freezer: Don't get over-anxious while waiting

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Dec 5, 2009, 4:50:03 PM12/5/09

to

On Sat, 5 Dec 2009, Rafael J. Wysocki wrote:
>
> * Asynchronous suspend and resume infrastructure. For now, PCI, ACPI and
> serio devices are enabled to suspend and resume asynchronously.

I really think this is totally and utterly broken. Both from an
implementation standpoint _and_ from a pure conceptual one.

Why isn't the suspend/resume async stuff just done like the init async
stuff?

We don't need that crazy per-device flag for initialization, neither do we
need drivers "enabling" any async code at all. They just do some things
asynchronously, and then at the end of init time we wait for all those
async events.

So why does suspend/resume need to do crazy sh*t instead?

It all looks terminally broken: you force async suspend for all PCI
drivers, even when it makes no sense. Rather than let the drivers that
already know how to do things like disk spinup asynchronously just do it
that way.

The "timing" routines are also just crazy. What is the excuse for
dpm_show_time() taking both start and stop times, since there is never any
valid situation when it shouldn't have that do_gettimgofday(&stop) just
before it? IOW - the whole end-time thing should be _inside_
dpm_show_time, rather than being done by the caller. No?

In other words - I'm not pulling this crazy thing. You'd better explain
why it was done that way, when we already have done the same things better
before in different ways.

Linus

Linus Torvalds

unread,

Dec 5, 2009, 5:00:02 PM12/5/09

to

On Sat, 5 Dec 2009, Linus Torvalds wrote:
>
> In other words - I'm not pulling this crazy thing. You'd better explain
> why it was done that way, when we already have done the same things better
> before in different ways.

I get the feeling that all the crazy infrastructure was due to worrying
about the suspend/resume topology.

But the reason we don't worry about that during init is that it doesn't
really tend to matter. Most slow operations are the things that aren't
topology-aware, ie things like spinning up/down disks etc, that really
could be done as a separate phase instead.

For example, is there really any reason why resume doesn't look exactly
like the init sequence? Drivers that do slow things can start async work
to do them, and then at the end of the resume sequence we just do a "wait
for all the async work", exactly like we do for the current init
sequences.

And yes, for the suspend sequence we obviously need to do any async work
(and wait for it) before we actually shut down the controllers, but that
would be _way_ more natural to do by just introducing a "pre-suspend" hook
that walks the device tree and does any async stuff. And then just wait
for the async stuff to finish before doing the suspend, and perhaps again
before doing late_suspend (maybe somebody wants to do async stuff at the
second stage too).

Then, because we need a way to undo things if things go wrong in the
middle (and because it's also nice to be symmetric), we'd probably want to
introduce that kind of "post_resume()" callback that allows you have a
separate async wakeup thing for resume time too.

What are actually the expensive/slow things during suspend/resume? Am I
wrong when I say it's things like disk spinup/spindown (and USB discovery,
which needs USB-level support anyway, since it can involve devices that we
didn't even know about before discovery started).

Rafael J. Wysocki

unread,

Dec 5, 2009, 7:00:03 PM12/5/09

to

On Saturday 05 December 2009, Linus Torvalds wrote:
>
> On Sat, 5 Dec 2009, Linus Torvalds wrote:
> >
> > In other words - I'm not pulling this crazy thing. You'd better explain
> > why it was done that way, when we already have done the same things better
> > before in different ways.

OK, I'll send another pull request without these patches if the rest of the
changes if fine with you (they are more important than the async stuff to me).

> I get the feeling that all the crazy infrastructure was due to worrying
> about the suspend/resume topology.

Yes, that's the main reason.

> But the reason we don't worry about that during init is that it doesn't
> really tend to matter. Most slow operations are the things that aren't
> topology-aware, ie things like spinning up/down disks etc, that really
> could be done as a separate phase instead.

It was based on the observation that in many cases the current drivers' suspend
and resume callbacks can be run in parallel with the other drivers' callbacks
without any changes to the drivers (and without introducing another phase of
suspend for that matter), because there are no dependencies between them.

The approach you're suggesting would require modifying individual drivers which
I just wanted to avoid. If you don't like that, we'll have to take the longer
route, although I'm afraid that will take lots of time and we won't be able to
exploit the entire possible parallelism this way.

> For example, is there really any reason why resume doesn't look exactly
> like the init sequence? Drivers that do slow things can start async work
> to do them, and then at the end of the resume sequence we just do a "wait
> for all the async work", exactly like we do for the current init
> sequences.

During suspend we actually know what the dependences between the devicces
are and we can use that information to do more things in parallel. For
instance, in the majority of cases (I'm yet to find a counter example), the
entire suspend callbacks of "leaf" PCI devices may be run in parallel with each
other.

So, the point is not to look for "async stuff" in a driver's suspend/resume
callbacks, but to execute the whole suspend/resume callbacks in parallel,
if possible.

> And yes, for the suspend sequence we obviously need to do any async work
> (and wait for it) before we actually shut down the controllers, but that
> would be _way_ more natural to do by just introducing a "pre-suspend" hook
> that walks the device tree and does any async stuff. And then just wait
> for the async stuff to finish before doing the suspend, and perhaps again
> before doing late_suspend (maybe somebody wants to do async stuff at the
> second stage too).
>
> Then, because we need a way to undo things if things go wrong in the
> middle (and because it's also nice to be symmetric), we'd probably want to
> introduce that kind of "post_resume()" callback that allows you have a
> separate async wakeup thing for resume time too.

Yes, we can do that, but I'm afraid that the majority of drivers won't use the
new hooks (people generally seem to be to reluctant to modify their
suspend/resume callbacks not to break things).

Also, for an individual driver it really is difficult to separate the "async
stuff" from the stuff which is not async, because everything that can be done
in parallel with the other drivers' suspend callbacks is potentially async, as
long as there are no dependences between the devices in question (like
parent-child dependences, or PCI-shadow ACPI dependences). And it's
generally worth doing that if a driver's suspend or resume callback calls
msleep() for whatever the reason.

> What are actually the expensive/slow things during suspend/resume? Am I
> wrong when I say it's things like disk spinup/spindown (and USB discovery,
> which needs USB-level support anyway, since it can involve devices that we
> didn't even know about before discovery started).

Disk spinup/spindown takes time, but also some ACPI devices resume slowly,
serio devices do that too and there are surprisingly many drivers that wait
(using msleep() during suspend and resume). Apart from this, every PCI device
going from D0 to D3 during suspend and from D3 to D0 during resume requires
us to sleep for 10 ms (the sleeping is done by the PCI core, so the drivers
don't even realize its there).

Thanks,
Rafael

Rafael J. Wysocki

unread,

Dec 5, 2009, 7:30:01 PM12/5/09

to

On Saturday 05 December 2009, Linus Torvalds wrote:
>

> On Sat, 5 Dec 2009, Rafael J. Wysocki wrote:
> >
> > * Asynchronous suspend and resume infrastructure. For now, PCI, ACPI and
> > serio devices are enabled to suspend and resume asynchronously.
>
> I really think this is totally and utterly broken. Both from an
> implementation standpoint _and_ from a pure conceptual one.
>
> Why isn't the suspend/resume async stuff just done like the init async
> stuff?
>
> We don't need that crazy per-device flag for initialization, neither do we
> need drivers "enabling" any async code at all. They just do some things
> asynchronously, and then at the end of init time we wait for all those
> async events.
>
> So why does suspend/resume need to do crazy sh*t instead?

Because it can run entire suspend and resume callbacks in parallel and not
just some stuff inside of them. The flag is to tell it which callbacks not to
execute in parallel, but it essentially should not be necessary as soon as we
know all dependences between devices (ie. the ones that are not encoded in
the structure of the device tree).

The problem is there are dependences between devices we're not aware of,
which are not documented anywhere and not reflected by the device tree
structure and we need some time to figure them out.

> It all looks terminally broken: you force async suspend for all PCI
> drivers, even when it makes no sense.

I'm not exactly sure what you're referring to. The async suspend is not
forced, it just tells the PM core that it can execute PCI suspend/resume
callbacks in parallel as long as the devices in question don't depend on each
other.

> Rather than let the drivers that already know how to do things like disk
> spinup asynchronously just do it that way.

This isn't just about disk spin up and things like that. If we can run entire
suspend/resume callbacks in parallel, why not to do that?

> The "timing" routines are also just crazy. What is the excuse for
> dpm_show_time() taking both start and stop times,

This is a mistake, although really easily fixable in a followup patch.

> since there is never any valid situation when it shouldn't have that
> do_gettimgofday(&stop) just before it? IOW - the whole end-time thing should
> be _inside_ dpm_show_time, rather than being done by the caller. No?

Yes, you're right.

> In other words - I'm not pulling this crazy thing. You'd better explain
> why it was done that way, when we already have done the same things better
> before in different ways.

I'm not sure we have, but whatever.

As I said before, if the rest of the changes in my pull request are fine with
you, I'll just drop the async changes, although I'm not really convinced
they're so bad. They've been discussed a lot and they've been in linux-next
for a few months without any objection from anyone.

Thanks,
Rafael

Arjan van de Ven

unread,

Dec 5, 2009, 7:50:01 PM12/5/09

to

On Sun, 6 Dec 2009 00:55:36 +0100
"Rafael J. Wysocki" <r...@sisk.pl> wrote:

>
> Disk spinup/spindown takes time, but also some ACPI devices resume
> slowly, serio devices do that too and there are surprisingly many
> drivers that wait (using msleep() during suspend and resume). Apart
> from this, every PCI device going from D0 to D3 during suspend and
> from D3 to D0 during resume requires us to sleep for 10 ms (the
> sleeping is done by the PCI core, so the drivers don't even realize
> its there).

maybe a good step is to make a scripts/bootgraph.pl equivalent for
suspend/resume (or make a debug mode that outputs in a compatible format
so that the script can be used as is.. I don't mind either way, and
consider this my offer to help with such a script as long as there's
sufficient logging in dmesg ;-)

that way we can SEE which ones are an issue.... and by how much.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Linus Torvalds

unread,

Dec 5, 2009, 7:50:01 PM12/5/09

to

On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> The approach you're suggesting would require modifying individual drivers which
> I just wanted to avoid.

In the init path, we had the reverse worry - not wanting to make
everything (where "everything" can be some subsystem like just the set of
PCI drivers, of course - not really "everything" in an absolute sense)
async, and then having to try to work out with the random driver that
couldn't handle it.

And there were _lots_ of drivers that couldn't handle it, because they
knew they got woken up serially. The ATA layer needed to know about
asynchronous things, because sometimes those independent devices aren't so
independent at all. Which is why I don't think your approach is safe.

Just to take an example of the whole "independent devices are not
necessarily independent" thing - things like multi-port PCMCIA controllers
generally show up as multiple PCI devices. But they are _not_ independent,
and they actually share some registers. Resuming them asynchronously might
well be ok, but maybe it's not. Who knows?

In contrast, a device driver can generally know that certain _parts_ of
the initialization is safe. As an example of that, I think the libata
layer does all the port enumeration synchronously, but then once the ports
have been identified, it does the rest async.

That's the kind of decision we can sanely make when we do the async part
as a "drivers may choose to do certain parts asynchronously". Doing it at
a higher level sounds like a problem to me.

> If you don't like that, we'll have to take the longer route, although
> I'm afraid that will take lots of time and we won't be able to exploit
> the entire possible parallelism this way.

Sure. But I'd rather do the safe thing. Especially since there are likely
just a few cases that really take a long time.

> During suspend we actually know what the dependences between the devicces
> are and we can use that information to do more things in parallel. For
> instance, in the majority of cases (I'm yet to find a counter example), the
> entire suspend callbacks of "leaf" PCI devices may be run in parallel with each
> other.

See above. That's simply not at all guaranteed to be true.

And when it isn't true (ie different PCI leaf devices end up having subtle
dependencies), now you need to start doing hacky things.

I'd much rather have the individual drivers say "I can do this part in
parallel", and not force it on them. Because it is definitely _not_
guaranteed that PCI devices can do parallel resume and suspend.

> Yes, we can do that, but I'm afraid that the majority of drivers won't use the
> new hooks (people generally seem to be to reluctant to modify their
> suspend/resume callbacks not to break things).

See above - I don't think this is a "majority" issue. I think it's a
"let's figure out the problem spots, and fix _those_". IOW, get 2% of the
coverage, and get 95% of the advantage.

> Disk spinup/spindown takes time, but also some ACPI devices resume slowly,

We actually saw that when we did async init. And it was horrible. There's
nothing that says that the ACPI stuff necessarily even _can_ run in
parallel.

I think we currently only do the ACPI battery ops asynchronously.

Linus

Linus Torvalds

unread,

Dec 5, 2009, 8:00:02 PM12/5/09

to

On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> > It all looks terminally broken: you force async suspend for all PCI
> > drivers, even when it makes no sense.
>
> I'm not exactly sure what you're referring to. The async suspend is not
> forced, it just tells the PM core that it can execute PCI suspend/resume
> callbacks in parallel as long as the devices in question don't depend on each
> other.

That's exactly what I mean by forcing async suspend/resume.

You don't know the ordering rules for PCi devices. Multi-function PCI
devices commonly share registers - they're on the same chip, after all.
And even when the _hardware_ is totally independent, we often have
discovery rules and want to initialize in order because different drivers
will do things like unregister entirely on suspend, and then re-register
on resume.

Imagine the mess when two ethernet devices randomly end up coming up with
different names (eth0/eth1) depending on subtle timing issues.

THAT is why we do things in order. Asynchronous programming is _hard_.
Just deciding that "all PCI devices can always be resumed and suspended
asynchronously" is a much MUCH bigger decision than you seem to have
even realized.

Linus

Rafael J. Wysocki

unread,

Dec 5, 2009, 8:30:02 PM12/5/09

to

On Sunday 06 December 2009, Linus Torvalds wrote:
>
> On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > It all looks terminally broken: you force async suspend for all PCI
> > > drivers, even when it makes no sense.
> >
> > I'm not exactly sure what you're referring to. The async suspend is not
> > forced, it just tells the PM core that it can execute PCI suspend/resume
> > callbacks in parallel as long as the devices in question don't depend on each
> > other.
>
> That's exactly what I mean by forcing async suspend/resume.
>
> You don't know the ordering rules for PCi devices.

That's true at the moment, but in principle we can abstract all dependences
between devices as PM links that will enforce specific suspend/resume ordering
between them.

> Multi-function PCI devices commonly share registers - they're on the same
> chip, after all. And even when the _hardware_ is totally independent, we
> often have discovery rules and want to initialize in order because different
> drivers will do things like unregister entirely on suspend, and then
> re-register on resume.

Do any of the PCI drivers do that?

> Imagine the mess when two ethernet devices randomly end up coming up with
> different names (eth0/eth1) depending on subtle timing issues.
>
> THAT is why we do things in order. Asynchronous programming is _hard_.
> Just deciding that "all PCI devices can always be resumed and suspended
> asynchronously" is a much MUCH bigger decision than you seem to have
> even realized.

I have considered that, but at the end of the day I haven't seen a single
problem with that showing up in testing during the last two or three months.

Given the time the patchset spent in linux-next I'd expect someone to report
a problem with it - if there's a problem. But no one has said a word, so I'm
not that worried, although I'm still a bit cautious.

That's why there is the switch for disabling the feature altogether. It is
enabled by default, which perhaps is not the right setting, but I don't really
see the reason why not to turn it on where it doesn't break things (like on
all of my test boxes at the moment).

Still, as I said before, the other changes in my pull request are more
important to me than the async patchset, so please let me know if they are fine
with you.

Thanks,
Rafael

Rafael J. Wysocki

unread,

Dec 5, 2009, 8:30:02 PM12/5/09

to

On Sunday 06 December 2009, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 00:55:36 +0100
> "Rafael J. Wysocki" <r...@sisk.pl> wrote:
>
> >
> > Disk spinup/spindown takes time, but also some ACPI devices resume
> > slowly, serio devices do that too and there are surprisingly many
> > drivers that wait (using msleep() during suspend and resume). Apart
> > from this, every PCI device going from D0 to D3 during suspend and
> > from D3 to D0 during resume requires us to sleep for 10 ms (the
> > sleeping is done by the PCI core, so the drivers don't even realize
> > its there).
>
> maybe a good step is to make a scripts/bootgraph.pl equivalent for
> suspend/resume (or make a debug mode that outputs in a compatible format
> so that the script can be used as is.. I don't mind either way, and
> consider this my offer to help with such a script as long as there's
> sufficient logging in dmesg ;-)

OK, so what kind of logging is needed?

> that way we can SEE which ones are an issue.... and by how much.

Well, why not.

Thanks,
Rafael

Linus Torvalds

unread,

Dec 5, 2009, 9:00:02 PM12/5/09

to

On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> > Multi-function PCI devices commonly share registers - they're on the same
> > chip, after all. And even when the _hardware_ is totally independent, we
> > often have discovery rules and want to initialize in order because different
> > drivers will do things like unregister entirely on suspend, and then
> > re-register on resume.
>
> Do any of the PCI drivers do that?

It used to be common at least for ethernet - there were a number of
drivers that essentially did the same thing on suspend/resume and on
module unload/reload.

The point is, I don't know. And neither do you. It's much safer to just do
drivers one by one, and not touch drivers that people don't test.

Linus

Rafael J. Wysocki

unread,

Dec 5, 2009, 9:00:02 PM12/5/09

to

On Sunday 06 December 2009, Rafael J. Wysocki wrote:

> On Sunday 06 December 2009, Linus Torvalds wrote:
> >
> > On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> > >
> > > The approach you're suggesting would require modifying individual drivers which
> > > I just wanted to avoid.
> >
> > In the init path, we had the reverse worry - not wanting to make
> > everything (where "everything" can be some subsystem like just the set of
> > PCI drivers, of course - not really "everything" in an absolute sense)
> > async, and then having to try to work out with the random driver that
> > couldn't handle it.
> >
> > And there were _lots_ of drivers that couldn't handle it, because they
> > knew they got woken up serially. The ATA layer needed to know about
> > asynchronous things, because sometimes those independent devices aren't so
> > independent at all. Which is why I don't think your approach is safe.
>

> While the current settings are probably unsafe (like enabling PCI devices
> to be suspended asynchronously by default if there are not any direct
> dependences between them), there are provisions to make eveything safe, if
> we have enough information (which also is needed to put the required logic into
> the drivers). The device tree represents a good deal of the dependences
> between devices and the other dependences may be represented as PM links
> enforcing specific ordering of the PM callbacks.

>
> > Just to take an example of the whole "independent devices are not
> > necessarily independent" thing - things like multi-port PCMCIA controllers
> > generally show up as multiple PCI devices. But they are _not_ independent,
> > and they actually share some registers. Resuming them asynchronously might
> > well be ok, but maybe it's not. Who knows?
>

> I'd say if there's a worry that the same register may be accessed concurrently
> from two different code paths, there should be some locking in place.

>
> > In contrast, a device driver can generally know that certain _parts_ of
> > the initialization is safe. As an example of that, I think the libata
> > layer does all the port enumeration synchronously, but then once the ports
> > have been identified, it does the rest async.
> >
> > That's the kind of decision we can sanely make when we do the async part
> > as a "drivers may choose to do certain parts asynchronously". Doing it at
> > a higher level sounds like a problem to me.
>

> The difference between suspend and initialization is that during suspend we
> have already enumerated all devices and we should know how they depend on
> each other (and we really should know that if we are to actually understand how
> things work), so we can represent that information somehow and use it to do
> things at the higher level.
>
> How to represent it is a different matter, but in principle it should be
> possible.

>
> > > If you don't like that, we'll have to take the longer route, although
> > > I'm afraid that will take lots of time and we won't be able to exploit
> > > the entire possible parallelism this way.
> >
> > Sure. But I'd rather do the safe thing. Especially since there are likely
> > just a few cases that really take a long time.
>

> And there are lots of small sleeps here and there that accumulate and are
> entirely avoidable.

I mean, it is avoidable to do all these sleeps sequentially.

Thanks,
Rafael

Arjan van de Ven

unread,

Dec 5, 2009, 9:00:02 PM12/5/09

to

On Sun, 6 Dec 2009 02:26:06 +0100

"Rafael J. Wysocki" <r...@sisk.pl> wrote:

> On Sunday 06 December 2009, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 00:55:36 +0100
> > "Rafael J. Wysocki" <r...@sisk.pl> wrote:
> >
> > >
> > > Disk spinup/spindown takes time, but also some ACPI devices resume
> > > slowly, serio devices do that too and there are surprisingly many
> > > drivers that wait (using msleep() during suspend and resume).
> > > Apart from this, every PCI device going from D0 to D3 during
> > > suspend and from D3 to D0 during resume requires us to sleep for
> > > 10 ms (the sleeping is done by the PCI core, so the drivers don't
> > > even realize its there).
> >
> > maybe a good step is to make a scripts/bootgraph.pl equivalent for
> > suspend/resume (or make a debug mode that outputs in a compatible
> > format so that the script can be used as is.. I don't mind either
> > way, and consider this my offer to help with such a script as long
> > as there's sufficient logging in dmesg ;-)
>
> OK, so what kind of logging is needed?

basically the equivalent of the two initcall_debug paths in

init/main.c:do_one_initcall()

which prints a start time and an end time (and a pid) for each init
function; if we have the same for suspend calls (and resume)... we can
make the tool graph it.
Would be nice to get markers for start and end of the whole suspend
sequence as well as for the resume sequence; those make it easier to
know when the end is (so that the axis can be drawn etc)

shouldn't be too hard to implement...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Rafael J. Wysocki

unread,

Dec 5, 2009, 9:00:02 PM12/5/09

to

On Sunday 06 December 2009, Linus Torvalds wrote:
>
> On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> >
> > The approach you're suggesting would require modifying individual drivers which
> > I just wanted to avoid.
>
> In the init path, we had the reverse worry - not wanting to make
> everything (where "everything" can be some subsystem like just the set of
> PCI drivers, of course - not really "everything" in an absolute sense)
> async, and then having to try to work out with the random driver that
> couldn't handle it.
>
> And there were _lots_ of drivers that couldn't handle it, because they
> knew they got woken up serially. The ATA layer needed to know about
> asynchronous things, because sometimes those independent devices aren't so
> independent at all. Which is why I don't think your approach is safe.

While the current settings are probably unsafe (like enabling PCI devices

to be suspended asynchronously by default if there are not any direct
dependences between them), there are provisions to make eveything safe, if
we have enough information (which also is needed to put the required logic into
the drivers). The device tree represents a good deal of the dependences
between devices and the other dependences may be represented as PM links
enforcing specific ordering of the PM callbacks.

> Just to take an example of the whole "independent devices are not

> necessarily independent" thing - things like multi-port PCMCIA controllers
> generally show up as multiple PCI devices. But they are _not_ independent,
> and they actually share some registers. Resuming them asynchronously might
> well be ok, but maybe it's not. Who knows?

I'd say if there's a worry that the same register may be accessed concurrently

from two different code paths, there should be some locking in place.

> In contrast, a device driver can generally know that certain _parts_ of

> the initialization is safe. As an example of that, I think the libata
> layer does all the port enumeration synchronously, but then once the ports
> have been identified, it does the rest async.
>
> That's the kind of decision we can sanely make when we do the async part
> as a "drivers may choose to do certain parts asynchronously". Doing it at
> a higher level sounds like a problem to me.

The difference between suspend and initialization is that during suspend we

have already enumerated all devices and we should know how they depend on
each other (and we really should know that if we are to actually understand how
things work), so we can represent that information somehow and use it to do
things at the higher level.

How to represent it is a different matter, but in principle it should be
possible.

> > If you don't like that, we'll have to take the longer route, although

> > I'm afraid that will take lots of time and we won't be able to exploit
> > the entire possible parallelism this way.
>
> Sure. But I'd rather do the safe thing. Especially since there are likely
> just a few cases that really take a long time.

And there are lots of small sleeps here and there that accumulate and are
entirely avoidable.

> > During suspend we actually know what the dependences between the devicces

> > are and we can use that information to do more things in parallel. For
> > instance, in the majority of cases (I'm yet to find a counter example), the
> > entire suspend callbacks of "leaf" PCI devices may be run in parallel with each
> > other.
>
> See above. That's simply not at all guaranteed to be true.
>
> And when it isn't true (ie different PCI leaf devices end up having subtle
> dependencies), now you need to start doing hacky things.
>
> I'd much rather have the individual drivers say "I can do this part in
> parallel", and not force it on them. Because it is definitely _not_
> guaranteed that PCI devices can do parallel resume and suspend.

OK, it's not guaranteed, but why not to do this on systems where it's known
to work?

> > Yes, we can do that, but I'm afraid that the majority of drivers won't use the
> > new hooks (people generally seem to be to reluctant to modify their
> > suspend/resume callbacks not to break things).
>
> See above - I don't think this is a "majority" issue. I think it's a
> "let's figure out the problem spots, and fix _those_". IOW, get 2% of the
> coverage, and get 95% of the advantage.

I wouldn't really like to add even more suspend/resume callbacks for this
purpose, because we already have so many of them. And even if we do that,
I don't really expect drivers to start using them any time soon.

> > Disk spinup/spindown takes time, but also some ACPI devices resume slowly,
>
> We actually saw that when we did async init. And it was horrible. There's
> nothing that says that the ACPI stuff necessarily even _can_ run in
> parallel.
>
> I think we currently only do the ACPI battery ops asynchronously.

There are only a few ACPI devices that have real suspend/resume callbacks
and I haven't see problems with these in practice.

Thanks,
Rafael

Linus Torvalds

unread,

Dec 5, 2009, 9:10:01 PM12/5/09

to

On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
>
> While the current settings are probably unsafe (like enabling PCI devices
> to be suspended asynchronously by default if there are not any direct
> dependences between them), there are provisions to make eveything safe, if
> we have enough information (which also is needed to put the required logic into
> the drivers).

I disagree.

Think of a situation that we already handle pretty poorly: USB mass
storage devices over a suspend/resume.

> The device tree represents a good deal of the dependences
> between devices and the other dependences may be represented as PM links
> enforcing specific ordering of the PM callbacks.

The device tree means nothing at all, because it may need to be entirely
rebuilt at resume time.

Optimally, what we _should_ be doing (and aren't) for suspend/resume of
USB is to just tear down the whole topology and rebuild it and re-connect
the things like mass storage devices. IOW, there would be no device tree
to describe the topology, because we're finding it anew. And it's one of
the things we _would_ want to do asynchronously with other things.

We don't want to build up some irrelevant PM links and callbacks. We don't
want to have some completely made-up new infrastructure for something that
we _already_ want to handle totally differently for init time.

IOW, I argue very strongly against making up something PM-specific, when
there really doesn't seem to be much of an advantage. We're much better
off trying to share the init code than making up something new.

> I'd say if there's a worry that the same register may be accessed concurrently
> from two different code paths, there should be some locking in place.

Yeah. And I wish ACPI didn't exist at all. We don't know.

And we want to _limit_ our exposure to these things.

Linus

Rafael J. Wysocki

unread,

Dec 5, 2009, 9:40:01 PM12/5/09

to

On Sunday 06 December 2009, Linus Torvalds wrote:
>

> On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> >
> > While the current settings are probably unsafe (like enabling PCI devices
> > to be suspended asynchronously by default if there are not any direct
> > dependences between them), there are provisions to make eveything safe, if
> > we have enough information (which also is needed to put the required logic into
> > the drivers).
>
> I disagree.
>
> Think of a situation that we already handle pretty poorly: USB mass
> storage devices over a suspend/resume.
>
> > The device tree represents a good deal of the dependences
> > between devices and the other dependences may be represented as PM links
> > enforcing specific ordering of the PM callbacks.
>
> The device tree means nothing at all, because it may need to be entirely
> rebuilt at resume time.

With that assumption we have no choice but to leave the async stuff to the
drivers, which generally I'm fine with, although I really don't expect to see
it done.

> Optimally, what we _should_ be doing (and aren't) for suspend/resume of
> USB is to just tear down the whole topology and rebuild it and re-connect
> the things like mass storage devices. IOW, there would be no device tree
> to describe the topology, because we're finding it anew. And it's one of
> the things we _would_ want to do asynchronously with other things.

I think you should tell that to the USB people, because they don't seem to
think this way.

[Side note, I do think that at least some information in the device tree will
remain valid over suspend/resume, but this is a different matter.]

> We don't want to build up some irrelevant PM links and callbacks. We don't
> want to have some completely made-up new infrastructure for something that
> we _already_ want to handle totally differently for init time.
>
> IOW, I argue very strongly against making up something PM-specific, when
> there really doesn't seem to be much of an advantage. We're much better
> off trying to share the init code than making up something new.
>
> > I'd say if there's a worry that the same register may be accessed concurrently
> > from two different code paths, there should be some locking in place.
>
> Yeah. And I wish ACPI didn't exist at all. We don't know.
>
> And we want to _limit_ our exposure to these things.

Don't worry, I'm not going to touch async suspend/resume again, unless
somebody makes me do it.

BTW, you seem to have some quite strong opinions about power management that
you only share with people when somebody sends you patches you don't like. I
guess it will be much more productive if we know your thoughts about it in
advance, so I hope you won't mind being sent CCs of core PM patches posted to
linux-pm for discussions.

Thanks,
Rafael

Ingo Molnar

unread,

Dec 6, 2009, 3:50:01 AM12/6/09

to

I think an even better option would be to extend 'perf timechart' to be
suspend/resume aware: add a few tracepoint events and teach 'perf
timechart' to draw them. (We should be able to do perf timechart record
across suspend/resume cycles just fine.)

( Doing that would also improve the tracing facilities within
suspend/resume quite significantly. It wouldnt just be a
single-purpose thing for graphing, but perf trace and perf stat would
work equally well. )

Thanks,

Ingo

Alan Stern

unread,

Dec 6, 2009, 10:30:02 AM12/6/09

to

On Sat, 5 Dec 2009, Linus Torvalds wrote:

> Think of a situation that we already handle pretty poorly: USB mass
> storage devices over a suspend/resume.
>
> > The device tree represents a good deal of the dependences
> > between devices and the other dependences may be represented as PM links
> > enforcing specific ordering of the PM callbacks.
>
> The device tree means nothing at all, because it may need to be entirely
> rebuilt at resume time.

Nonsense.

> Optimally, what we _should_ be doing (and aren't) for suspend/resume of
> USB is to just tear down the whole topology and rebuild it and re-connect
> the things like mass storage devices. IOW, there would be no device tree
> to describe the topology, because we're finding it anew. And it's one of
> the things we _would_ want to do asynchronously with other things.

That's ridiculous. Having gone to all the trouble of building a device
tree, one which is presumably still almost entirely correct, why go to
all the trouble of tearing it down only to rebuild it again? (Note:
I'm talking about resume-from-RAM here, not resume-from-hibernation.)

Instead what we do is verify that the devices we remember from before
the suspend are still there, and then asynchronously handle new devices
which have been plugged in during the meantime. Doing this involves
relatively little extra or new code; most of the routines are shared
with the runtime PM and device reset paths.

As for asynchronicity... At init time, USB device discovery truly is
asynchronous. It can happen long after you log in (especially if you
don't plug in the device until after you log in!). But at resume time
we are more highly constrained. User processes cannot be unfrozen
until all the devices have been resumed; otherwise they would encounter
errors when trying to do I/O to a suspended device. (With the runtime
PM framework this is much less of a problem, but plenty of drivers
don't support runtime PM yet.)

> We don't want to build up some irrelevant PM links and callbacks. We don't
> want to have some completely made-up new infrastructure for something that
> we _already_ want to handle totally differently for init time.
>
> IOW, I argue very strongly against making up something PM-specific, when
> there really doesn't seem to be much of an advantage. We're much better
> off trying to share the init code than making up something new.

If I understand correctly, what you're suggesting is impractical. You
would have each driver responsible for resuming the devices it
registers. If it registered some children synchronously (during the
parent's probe) then it would resume them synchronously (during the
parent's resume); if it registered them asynchronously then it would
resume them asynchronously. In essence, every single device_add() or
device_register() call would have to be paired with a resume call.

To make such significant changes in every driver would be prohibitively
difficult. What we need is a compromise which gives drivers control
over the resume process without making them responsible for actually
carrying it out.

So consider this suggestion: Let's define PM groups. Each device
belongs to a group, and each group (except group 0, the initial group)
has an owner device. By default a device is added to its parent's
group during registration, but the driver can request that it be
assigned to a different group, which must be owned by that parent.

During resume, each PM group would correspond to an async task. The
devices in each group would be resumed sequentially, in order of
registration, but asynchronously with respect to other groups. The
async thread to resume a group would be launched after the group's
owner device was resumed.

So for example, the sibling functions on a PCI card could all be
assigned to the same group, but different cards could belong to
different groups. Likewise for ATA and PCMCIA controllers. Extra
cross-group constraints could be added if needed, but there should be
relatively few of them.

This way drivers can decide which of their devices will be resumed in
sequence or concurrently, but they won't have to do any of the
necessary work.

Alan Stern

Victor Lowther

unread,

Dec 6, 2009, 2:10:01 PM12/6/09

to

On Sun, 2009-12-06 at 10:23 -0500, Alan Stern wrote:
> On Sat, 5 Dec 2009, Linus Torvalds wrote:
> That's ridiculous. Having gone to all the trouble of building a device
> tree, one which is presumably still almost entirely correct, why go to
> all the trouble of tearing it down only to rebuild it again? (Note:
> I'm talking about resume-from-RAM here, not resume-from-hibernation.)

There should be nothing special or privileged at all about the device
tree that gets built at boot time.

Consider the scenario of the laptop user with a docking station.
Adding, removing, and rewriting vast swaths of the device tree across
suspend/resume and hibernate/thaw is very easy to do when you are
plugging a laptop into one or more docking stations.

> Instead what we do is verify that the devices we remember from before
> the suspend are still there, and then asynchronously handle new devices
> which have been plugged in during the meantime. Doing this involves
> relatively little extra or new code; most of the routines are shared
> with the runtime PM and device reset paths.

Devices can vanish across suspend to RAM just as easily as they can be
added.

>
> _______________________________________________
> linux-pm mailing list
> linu...@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm

Arjan van de Ven

unread,

Dec 6, 2009, 2:40:02 PM12/6/09

to

On Sat, 5 Dec 2009 18:05:14 -0800 (PST)
Linus Torvalds <torv...@linux-foundation.org> wrote:

>
>
> On Sun, 6 Dec 2009, Rafael J. Wysocki wrote:
> >
> > While the current settings are probably unsafe (like enabling PCI
> > devices to be suspended asynchronously by default if there are not
> > any direct dependences between them), there are provisions to make
> > eveything safe, if we have enough information (which also is needed
> > to put the required logic into the drivers).
>
> I disagree.
>
> Think of a situation that we already handle pretty poorly: USB mass
> storage devices over a suspend/resume.
>
> > The device tree represents a good deal of the dependences
> > between devices and the other dependences may be represented as PM
> > links enforcing specific ordering of the PM callbacks.
>
> The device tree means nothing at all, because it may need to be
> entirely rebuilt at resume time.

btw I instrumented both the suspend and resume, and made graphs out of
it for my laptop (modern laptop with Intel cpu/wifi/graphics of course).

http://www.fenrus.org/graphs/suspend.svg
http://www.fenrus.org/graphs/resume.svg

(also attached for convenience)

the resume clearly shows that all this talking about PCI stuff is
completely without practical merit.. it's the USB stuff where the time
is spent.

in suspend, there's a PCI device (:1b) that does take some time, which
is the audio controller. The bulk of the time is in the serio driver
though..

As an "interested bystander" to this thread.... sounds like Linus'
arguments have merit, and that solving the USB resume to go async in
some form will fix pretty much all we want solving...

[and that at least we need to do this stuff data/measurement driven,
and not just based on how we THINK things work]

suspend.svgz

resume.svgz

Linus Torvalds

unread,

Dec 6, 2009, 3:00:02 PM12/6/09

to

On Sun, 6 Dec 2009, Arjan van de Ven wrote:
>
> in suspend, there's a PCI device (:1b) that does take some time, which
> is the audio controller. The bulk of the time is in the serio driver
> though..

That serio thing is disgusting. We had serious problems with the serial
driver timeouts for boot-time optimizations too, didn't we?

I assume that you don't even _use_ that serial port, do you? Or is it open
for serial console logging or something? If it isn't even open, we
shouldn't waste any time on the hardware.

Your graph seems to say that serio1 shutdown is roughly from 29.40 to
29.85, ie almost half a second. That's just bogus.

I don't see where it comes from, though. It looks like we have

- pciserial_suspend_ports/serial_pnp_suspend ->
serial8250_suspend_port ->
uart_suspend_port ->
(wait for tx_empty, but only for ASYNC_INITIALIZED, which
shouldn't be true if it's closed, and should be limited to 30ms)
uart_change_pm ->
serial8250_pm

and none of them look like they should take anywhere close to half a
second. So I'm obviously missing something, and your chart didn't include
the sleep/wakeup pairs.

Arjan van de Ven

unread,

Dec 6, 2009, 3:20:02 PM12/6/09

to

On Sun, 6 Dec 2009 11:58:44 -0800 (PST)
Linus Torvalds <torv...@linux-foundation.org> wrote:

>
>
> On Sun, 6 Dec 2009, Arjan van de Ven wrote:
> >
> > in suspend, there's a PCI device (:1b) that does take some time,
> > which is the audio controller. The bulk of the time is in the serio
> > driver though..
>
> That serio thing is disgusting. We had serious problems with the
> serial driver timeouts for boot-time optimizations too, didn't we?

isn't serio the PS/2 stuff?

(serio was an issue during boot as well due to some interesting rcu
delays btw)

> and none of them look like they should take anywhere close to half a
> second. So I'm obviously missing something, and your chart didn't
> include the sleep/wakeup pairs.

what do you mean by this? what would you like to see ?
(I have a separate graph for resume.. but the graphing program does not
show those things that take so short to resume that the font to print
the name would be less than a pixel; can fix that)

[fwiw I care more about resume speed than suspend speed, but obviously
am happy if both get fixed... just resume tends to be much more user
interesting, just like getting out of the idle loop matters more than
getting into it]

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Alan Stern

unread,

Dec 6, 2009, 3:40:02 PM12/6/09

to

On Sun, 6 Dec 2009, Arjan van de Ven wrote:

> btw I instrumented both the suspend and resume, and made graphs out of
> it for my laptop (modern laptop with Intel cpu/wifi/graphics of course).
>
> http://www.fenrus.org/graphs/suspend.svg
> http://www.fenrus.org/graphs/resume.svg
>
> (also attached for convenience)
>
> the resume clearly shows that all this talking about PCI stuff is
> completely without practical merit.. it's the USB stuff where the time
> is spent.

Arjan, can you try testing the USB timings again with the patch below
(for vanilla 2.6.32)?

Fair warning: I just composed this and haven't tried it out myself.

Thanks,

Alan Stern

Index: 2.6.32/drivers/usb/core/driver.c
===================================================================
--- 2.6.32.orig/drivers/usb/core/driver.c
+++ 2.6.32/drivers/usb/core/driver.c
@@ -1313,8 +1313,9 @@ static int usb_resume_both(struct usb_de
* then we're stuck. */
status = usb_resume_device(udev, msg);
}
- } else if (udev->reset_resume)
+ } else {
status = usb_resume_device(udev, msg);
+ }

if (status == 0 && udev->actconfig) {
for (i = 0; i < udev->actconfig->desc.bNumInterfaces; i++) {
Index: 2.6.32/drivers/usb/core/hub.c
===================================================================
--- 2.6.32.orig/drivers/usb/core/hub.c
+++ 2.6.32/drivers/usb/core/hub.c
@@ -1674,7 +1674,7 @@ static int usb_configure_device_otg(stru
* (Includes HNP test device.)
*/
if (udev->bus->b_hnp_enable || udev->bus->is_b_host) {
- err = usb_port_suspend(udev, PMSG_SUSPEND);
+ err = usb_port_suspend(udev, PMSG_AUTO_SUSPEND);
if (err < 0)
dev_dbg(&udev->dev, "HNP fail, %d\n", err);
}
@@ -2060,6 +2060,7 @@ static int check_port_resume_type(struct
/*
* usb_port_suspend - suspend a usb device's upstream port
* @udev: device that's no longer in active use, not a root hub
+ * @msg: Power Management message describing this state transition
* Context: must be able to sleep; device not locked; pm locks held
*
* Suspends a USB device that isn't in active use, conserving power.
@@ -2107,7 +2108,7 @@ int usb_port_suspend(struct usb_device *
{
struct usb_hub *hub = hdev_to_hub(udev->parent);
int port1 = udev->portnum;
- int status;
+ int status = 0;

// dev_dbg(hub->intfdev, "suspend port %d\n", port1);

@@ -2128,6 +2129,13 @@ int usb_port_suspend(struct usb_device *
status);
}

+ /* For system sleep transitions we don't actually need to suspend
+ * the port. The device will suspend itself when the entire bus
+ * is suspended.
+ */
+ if (!(msg.event & (PM_EVENT_USER | PM_EVENT_REMOTE | PM_EVENT_AUTO)))
+ return status;
+
/* see 7.1.7.6 */
status = set_port_feature(hub->hdev, port1, USB_PORT_FEAT_SUSPEND);
if (status) {
@@ -2231,6 +2239,7 @@ static int finish_port_resume(struct usb
/*
* usb_port_resume - re-activate a suspended usb device's upstream port
* @udev: device to re-activate, not a root hub
+ * @msg: Power Management message describing this state transition
* Context: must be able to sleep; device not locked; pm locks held
*
* This will re-activate the suspended device, increasing power usage

Linus Torvalds

unread,

Dec 6, 2009, 4:10:04 PM12/6/09

to

On Sun, 6 Dec 2009, Arjan van de Ven wrote:
> > That serio thing is disgusting. We had serious problems with the
> > serial driver timeouts for boot-time optimizations too, didn't we?
>
> isn't serio the PS/2 stuff?

Oh, you're right, I just assumed it was regular serial. So it's the
keyboard and mouse.

Linus

Arjan van de Ven

unread,

Dec 6, 2009, 4:20:02 PM12/6/09

to

On Sun, 6 Dec 2009 15:36:40 -0500 (EST)
Alan Stern <st...@rowland.harvard.edu> wrote:

> On Sun, 6 Dec 2009, Arjan van de Ven wrote:
>
> > btw I instrumented both the suspend and resume, and made graphs out
> > of it for my laptop (modern laptop with Intel cpu/wifi/graphics of
> > course).
> >
> > http://www.fenrus.org/graphs/suspend.svg
> > http://www.fenrus.org/graphs/resume.svg
> >
> > (also attached for convenience)
> >
> > the resume clearly shows that all this talking about PCI stuff is
> > completely without practical merit.. it's the USB stuff where the
> > time is spent.
>
> Arjan, can you try testing the USB timings again with the patch below
> (for vanilla 2.6.32)?
>
> Fair warning: I just composed this and haven't tried it out myself.

unfortunately it does not make a difference that I can notice in the
graphs.

http://www.fenrus.org/graphs/resume2.svg

the resume problem seems to be that we resume all the hubs sequentially,
much like we used to discover them sequentially during boot....

I do not know how much I'm asking for, but would it be sensible to do a
similar thing for hub resume as we did for boot? eg start resuming them
all at the same time, so that the mandatory delays of these hubs will
overlap ?

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Alan Stern

unread,

Dec 6, 2009, 4:50:01 PM12/6/09

to

On Sun, 6 Dec 2009, Arjan van de Ven wrote:

> > Arjan, can you try testing the USB timings again with the patch below
> > (for vanilla 2.6.32)?
> >
> > Fair warning: I just composed this and haven't tried it out myself.
>
> unfortunately it does not make a difference that I can notice in the
> graphs.
>
> http://www.fenrus.org/graphs/resume2.svg

Disappointing...

> the resume problem seems to be that we resume all the hubs sequentially,
> much like we used to discover them sequentially during boot....

But the patch should have reduced the time required to resume each
non-root hub. So the fact that they go sequentially shouldn't matter
as much.

For root hubs the patch won't help. Their delays can't be reduced.

> I do not know how much I'm asking for, but would it be sensible to do a
> similar thing for hub resume as we did for boot? eg start resuming them
> all at the same time, so that the mandatory delays of these hubs will
> overlap ?

For one thing, there shouldn't be any mandatory delays for non-root
hubs during resume-from-RAM (although this depends to some extent on
your system firmware -- and it probably helps to have USB-2.0 hubs
rather than USB-1.1).

More importantly, what you're asking is impossible given the way the PM
core is structured. The hub-resume routine can't return early because
then it wouldn't be possible to resume devices plugged into that hub.

(Ironically, your request is essentially what Rafael was trying to
accomplish in the patches that provoked this email conversation.)

Guess I'll just have to try out your timing log addition for myself and
see what's going on...

Alan Stern

Arjan van de Ven

unread,

Dec 6, 2009, 5:00:02 PM12/6/09

to

On Sun, 6 Dec 2009 16:46:16 -0500 (EST)
Alan Stern <st...@rowland.harvard.edu> wrote:
h won't help. Their delays can't be reduced.
>
> > I do not know how much I'm asking for, but would it be sensible to
> > do a similar thing for hub resume as we did for boot? eg start
> > resuming them all at the same time, so that the mandatory delays of
> > these hubs will overlap ?
>
> For one thing, there shouldn't be any mandatory delays for non-root
> hubs during resume-from-RAM (although this depends to some extent on
> your system firmware -- and it probably helps to have USB-2.0 hubs
> rather than USB-1.1).
>
> More importantly, what you're asking is impossible given the way the
> PM core is structured. The hub-resume routine can't return early
> because then it wouldn't be possible to resume devices plugged into
> that hub.

having spent 30 minutes trying to grok this code, I think there may be
a trick in using the async function call infrastructure.

if each USB hub's resume (hub_resume()) would be done as an async
function call, that would start allowing the hub resumes to go async,
but this is not enough.

usb_resume_both() would also then need to be an async call itself, and
do its "resume the parent" recursion as a async function call, and then
it needs to do a synchronization before actually resuming the device
itself (provided it is not a hub or hub like device I suppose).

the later synchronization guarantees that no device will be resumed
before it's parent tree structure is resumed, while allowing parallel
parts of the tree to be resumed in parallel.

The hard part in this is the locking.... that is getting non-trivial
once you have multiple asynchronous functions executing.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Alan Stern

unread,

Dec 6, 2009, 5:10:03 PM12/6/09

to

On Sun, 6 Dec 2009, Arjan van de Ven wrote:

> having spent 30 minutes trying to grok this code, I think there may be
> a trick in using the async function call infrastructure.
>
> if each USB hub's resume (hub_resume()) would be done as an async
> function call, that would start allowing the hub resumes to go async,
> but this is not enough.
>
> usb_resume_both() would also then need to be an async call itself, and
> do its "resume the parent" recursion as a async function call, and then
> it needs to do a synchronization before actually resuming the device
> itself (provided it is not a hub or hub like device I suppose).
>
> the later synchronization guarantees that no device will be resumed
> before it's parent tree structure is resumed, while allowing parallel
> parts of the tree to be resumed in parallel.
>
> The hard part in this is the locking.... that is getting non-trivial
> once you have multiple asynchronous functions executing.

That's the whole point of Rafael's async suspend/resume framework. He
has done the hard work already.

Alan Stern

Dmitry Torokhov

unread,

Dec 6, 2009, 6:00:01 PM12/6/09

to

On Dec 6, 2009, at 12:18 PM, Arjan van de Ven <ar...@infradead.org>
wrote:

> On Sun, 6 Dec 2009 11:58:44 -0800 (PST)
> Linus Torvalds <torv...@linux-foundation.org> wrote:
>
>>
>>
>> On Sun, 6 Dec 2009, Arjan van de Ven wrote:
>>>
>>> in suspend, there's a PCI device (:1b) that does take some time,
>>> which is the audio controller. The bulk of the time is in the serio
>>> driver though..
>>
>> That serio thing is disgusting. We had serious problems with the
>> serial driver timeouts for boot-time optimizations too, didn't we?
>
> isn't serio the PS/2 stuff?

Yes, that's your PS/2 mouse (rather touchpad) and the delay comes from
device reset (needed by some keyboard controllers - I remember HP -or
it and keyboard will be dead at resume).

--
Dmitry

Arjan van de Ven

unread,

Dec 6, 2009, 8:20:01 PM12/6/09

to

On Sun, 6 Dec 2009 14:54:48 -0800
Dmitry Torokhov <dmitry....@gmail.com> wrote:

> > isn't serio the PS/2 stuff?
>
> Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> from device reset (needed by some keyboard controllers - I remember
> HP -or it and keyboard will be dead at resume).

and I have a HP laptop... so this makes perfect sense.
Thanks for the explenation!

Now, the good news is that serio is near invisible on resume...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Arjan van de Ven

unread,

Dec 6, 2009, 8:20:01 PM12/6/09

to

On Sun, 6 Dec 2009 14:54:48 -0800
Dmitry Torokhov <dmitry....@gmail.com> wrote:

> Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> from device reset (needed by some keyboard controllers - I remember
> HP -or it and keyboard will be dead at resume).
>

btw could we do this reset in an async function call (as long as we
wait for it to complete before we pull the plug finally) ?

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Dmitry Torokhov

unread,

Dec 6, 2009, 9:30:02 PM12/6/09

to

On Sun, Dec 06, 2009 at 04:55:51PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 14:54:48 -0800
> Dmitry Torokhov <dmitry....@gmail.com> wrote:
>
> > > isn't serio the PS/2 stuff?
> >
> > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > from device reset (needed by some keyboard controllers - I remember
> > HP -or it and keyboard will be dead at resume).
>
> and I have a HP laptop... so this makes perfect sense.
> Thanks for the explenation!
>

Well, we do it for everyone, it's just a particular series of HPs forced
us to add it.

> Now, the good news is that serio is near invisible on resume...
>

Resume is fully offloaded to kseriod.

--
Dmitry

Dmitry Torokhov

unread,

Dec 6, 2009, 9:30:01 PM12/6/09

to

On Sun, Dec 06, 2009 at 05:18:56PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 14:54:48 -0800
> Dmitry Torokhov <dmitry....@gmail.com> wrote:
>
> > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > from device reset (needed by some keyboard controllers - I remember
> > HP -or it and keyboard will be dead at resume).
> >
>
> btw could we do this reset in an async function call (as long as we
> wait for it to complete before we pull the plug finally) ?

It has to complete before we start shutting down i8042, so there are
dependencies involved...

--
Dmitry

Zhang Rui

unread,

Dec 6, 2009, 11:00:05 PM12/6/09

to

yes, we've talked about something similar to this before. :)

Hi, Linus,
can you please look at this patch set and see if the idea is right?
http://marc.info/?l=linux-kernel&m=124840449826386&w=2
http://marc.info/?l=linux-acpi&m=124840456826456&w=2
http://marc.info/?l=linux-acpi&m=124840456926459&w=2
http://marc.info/?l=linux-acpi&m=124840457026468&w=2
http://marc.info/?l=linux-acpi&m=124840457126471&w=2

If yes, I'll pick them up again and rework a patch set, including some
good thoughts from Rafael.

thanks,
rui

> So for example, the sibling functions on a PCI card could all be
> assigned to the same group, but different cards could belong to
> different groups. Likewise for ATA and PCMCIA controllers. Extra
> cross-group constraints could be added if needed, but there should be
> relatively few of them.
>
> This way drivers can decide which of their devices will be resumed in
> sequence or concurrently, but they won't have to do any of the
> necessary work.
>

> Alan Stern
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in

Arjan van de Ven

unread,

Dec 7, 2009, 12:30:03 AM12/7/09

to

On Sun, 6 Dec 2009 18:27:56 -0800
Dmitry Torokhov <dmitry....@gmail.com> wrote:

> On Sun, Dec 06, 2009 at 04:55:51PM -0800, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 14:54:48 -0800
> > Dmitry Torokhov <dmitry....@gmail.com> wrote:
> >
> > > > isn't serio the PS/2 stuff?
> > >
> > > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > > from device reset (needed by some keyboard controllers - I
> > > remember HP -or it and keyboard will be dead at resume).
> >
> > and I have a HP laptop... so this makes perfect sense.
> > Thanks for the explenation!
> >
>
> Well, we do it for everyone, it's just a particular series of HPs
> forced us to add it.
>

wonder if it should be a DMI based quirk instead...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Arjan van de Ven

unread,

Dec 7, 2009, 12:30:02 AM12/7/09

to

On Sun, 6 Dec 2009 18:27:07 -0800
Dmitry Torokhov <dmitry....@gmail.com> wrote:

> On Sun, Dec 06, 2009 at 05:18:56PM -0800, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 14:54:48 -0800
> > Dmitry Torokhov <dmitry....@gmail.com> wrote:
> >
> > > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > > from device reset (needed by some keyboard controllers - I
> > > remember HP -or it and keyboard will be dead at resume).
> > >
> >
> > btw could we do this reset in an async function call (as long as we
> > wait for it to complete before we pull the plug finally) ?
>
> It has to complete before we start shutting down i8042, so there are
> dependencies involved...

async function calls have 2 methods for synchronization:

* inside an async function, you can wait for all "earlier" async
functions to complete (async_synchronize_cookie)
* outside an async function, you can wait for all scheduled async
functions to complete (async_synchronize_full)

so there's two options to use the async code to cut down this time:

1) Make both the mouse, keyboard AND the i8042 suspend functions async,
and in the i8042 function the code first synchronizes on all previous
async work
2) only make the mouse and keyboard suspend async, and just wait for all
async work in i8042 suspend

I strongly prefer number 1, in terms of getting the best suspend speed.
It means that all other suspend code can run in parallel to the whole
serio/i8042 suspend.
Option two is simpler, but the delay is in the normal, synchronous path,
so other suspend code will not run in parallel.

The good news is that neither is hard for someone familiar with the
code...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Linus Torvalds

unread,

Dec 7, 2009, 12:30:02 AM12/7/09

to

On Sun, 6 Dec 2009, Alan Stern wrote:
>
> That's ridiculous. Having gone to all the trouble of building a device
> tree, one which is presumably still almost entirely correct, why go to
> all the trouble of tearing it down only to rebuild it again? (Note:
> I'm talking about resume-from-RAM here, not resume-from-hibernation.)

Hey, I can believe that it's worth keeping the USB device tree, and just
validating it instead. However:

> If I understand correctly, what you're suggesting is impractical. You
> would have each driver responsible for resuming the devices it
> registers.

The thing is, for 99% of all devices, we really _really_ don't care.
Especially PCI devices.

Your average laptop will have something like ten PCI devices on it, and
99% of those have no delays at all outside of the millisecond-level ones
that it takes for power management register writes etc to take place.

So what I'm suggesting is to NOT DO ANY ASYNC RESUME AT ALL by default.
Because async device management is _hard_, and results in various nasty
ordering problems that are timing-dependent etc. And it's totally
pointless for almost all cases.

This is why I think it is so crazy to try to create those idiotic "this
device depends on that other" lists etc - it's adding serious conceptual
complexity for something that nobody cares about, and that just allows for
non-deterministic behavior that we don't even want.

> So consider this suggestion: Let's define PM groups.

Let's not.

I can imagine that doing USB resume specially is worth it, since USB is
fundamentally a pretty slow bus. But USB is also a fairly clear hierarchy,
so there is no point in PM groups or any other information outside of the
pure topology.

But there is absolutely zero point in doing that for devices in general.
PCI drivers simply do not want concurrent initialization. The upsides are
basically zero (win a few msecs?) and the downsides are the pointless
complexity. We don't do PCI discovery asyncronously either - for all the
same reasons.

Now, a PCI driver may then implement a bus that is slow (ie SCSI, ATA,
USB), and that bus may itself then want to do something else. If it really
is a good idea to add the whole hierarchical model to USB suspend/resume,
I can live with that, but that is absolutely no excuse for then doing it
for cases where the hierarchy is (a) known to be broken (ie the whole PCI
multifunction thing, but also things like motherboard power management
devices) and (b) don't have the same kind of slow bus issues.

Linus

Linus Torvalds

unread,

Dec 7, 2009, 1:00:02 AM12/7/09

to

On Mon, 7 Dec 2009, Zhang Rui wrote:
>
> Hi, Linus,
> can you please look at this patch set and see if the idea is right?
> http://marc.info/?l=linux-kernel&m=124840449826386&w=2
> http://marc.info/?l=linux-acpi&m=124840456826456&w=2
> http://marc.info/?l=linux-acpi&m=124840456926459&w=2
> http://marc.info/?l=linux-acpi&m=124840457026468&w=2
> http://marc.info/?l=linux-acpi&m=124840457126471&w=2

So I'm not entirely sure about that patch-set, but the thing I like about
it is how drivers really sign up to it one by one, rather than having all
PCI devices automatically signed up for async behavior.

That said, the thing I don't like about it is some of the same thing I
don't necessarily like about the series in Rafael's tree either: it looks
rather over-designed with the whole infrastructure for async device logic
(your patch in http://marc.info/?l=linux-acpi&m=124840456926459&w=2). How
would you explain that whole async_dev_register() logic in simple terms to
somebody else?

(I think yours is simpler that the one in the PM tree, but I dunno. I've
not really compared the two).

So let me explain my dislike by trying to outline some conceptually simple
thing that doesn't have any call-backs, doesn't have any "classes",
doesn't require registration etc. It just allows drivers at any level to
decide to do some things (not necessarily everything) asynchronously.

Here's the outline:

- first off: drivers that don't know that they nest clearly don't do
anything asynchronous. No "PCI devices can be done in parallel" crap,
because they really can't - not in the general case. So just forget
about that kind of logic entirely: it's just wrong.

- the 'suspend' thing is a depth-first tree walk. As we suspend a node,
we first suspend the child nodes, and then we suspend the node itself.
Everybody agrees about that, right?

- Trivial "async rule": the tree is walked synchronously, but as we walk
it, any point in the tree may decide to do some or all of its suspend
asynchronously. For example, when we hit a disk node, the disk driver
may just decide that (a) it knows that the disk is an independent thing
and (b) it's hierarchical wrt it's parent so (c) it can do the disk
suspend asynchronously.

- To protect against a parent node being suspended before any async child
work has completed, the child suspend - before it kicks off the actual
async work - just needs to take a read-lock on the parent (read-lock,
because you may have multiple children sharing a parent, and they don't
lock each other out). Then the only thing the asynchronous code needs
to do is to release the read lock when it is done.

- Now, the rule just becomes that the parent has to take a write lock on
itself when it suspends itself. That will automatically block until
all children are done.

Doesn't the above sound _simple_? Doesn't that sound like it should just
obviously do the right thing? It sounds like something you can explain as
a locking rule without having any complex semantic registration or
anything at all.

Now, the problem remains that when you walk the device tree starting off
all these potentially asynchronous events, you don't want to do that
serialization part (the "parent suspend") as you walk the tree - because
then you would only ever do one single level asynchronously. Which is why
I suggested splitting the suspend into a "pre-suspend" phase (and a
"post-resume" one). Because then the tree walk goes from

# single depth-first thing
suspend(root)
{
for_each_child(root) {
// This may take the parent lock for
// reading if it does something async
suspend(child);
}

// This serializes with any async children
write_lock(root->lock);
suspend_one_node(root);
write_unlock(root->lock);
}

to

# Phase one: walk the tree synchronously, starting any
# async work on the leaves
suspend_prepare(root)
{
for_each_child(root) {
// This may take the parent lock for
// reading if it does something async
suspend_prepare(child);
}
suspend_prepare_one_node(root);
}

# Phase two: walk the tree synchronously, waiting for
# and finishing the suspend
suspend(root)
{
for_each_child(root) {
suspend(child);
}
// This serializes with any async children started in phase 1
write_lock(root->lock);
suspend_one_node(root);
write_unlock(root->lock);
}

and I really think this should work.

The advantage: untouched drivers don't change ANY SEMANTICS AT ALL. If
they don't have a 'suspend_prepare()' function, then they still see that
exact same sequence of 'suspend()' calls. In fact, even if they have
children that _do_ have drivers that have that async phase, they'll never
know, because that simple write-semaphore trivially guarantees that
whether there was async work or not, it will be completed by the time we
call 'suspend()'.

And drivers that want to do things asynchronously don't need to register
or worry: all they do is literally

- move their 'suspend()' function to 'suspend_prepare()' instead

- add a

down_read(dev->parent->lock);
async_run(mysuspend, dev);

to the point that they want to be asynchronous (which may be _all_ of
it or just some slow part). The 'mysuspend' part would be the async
part.

- add a

up_read(dev->parent->lock);

to the end of their asynchronous 'mysuspend()' function, so that when
the child has finished suspending, the parent down_write() will finally
succeed.

Doesn't that all sound pretty simple? And it has a very clear architecture
that is pretty easy to explain to anybody who knows about traversing trees
depth-first.

No complex concepts. No change to existing tested drivers. No callbacks,
no flags, no nothing. And a pretty simple way for a driver to decide: I'll
do my suspends asynchronously (without parent drivers really ever even
having to know about it).

I dunno. Maybe I'm overlooking something, but the above is much closer to
what I think would be worth doing.

Linus

Dmitry Torokhov

unread,

Dec 7, 2009, 1:10:02 AM12/7/09

to

On Sun, Dec 06, 2009 at 09:26:00PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 18:27:56 -0800
> Dmitry Torokhov <dmitry....@gmail.com> wrote:
>
> > On Sun, Dec 06, 2009 at 04:55:51PM -0800, Arjan van de Ven wrote:
> > > On Sun, 6 Dec 2009 14:54:48 -0800
> > > Dmitry Torokhov <dmitry....@gmail.com> wrote:
> > >
> > > > > isn't serio the PS/2 stuff?
> > > >
> > > > Yes, that's your PS/2 mouse (rather touchpad) and the delay comes
> > > > from device reset (needed by some keyboard controllers - I
> > > > remember HP -or it and keyboard will be dead at resume).
> > >
> > > and I have a HP laptop... so this makes perfect sense.
> > > Thanks for the explenation!
> > >
> >
> > Well, we do it for everyone, it's just a particular series of HPs
> > forced us to add it.
> >
> wonder if it should be a DMI based quirk instead...
>

I have not received reports where it causes harm or reduces
functionality so I'd prefer having it by default and not try to race
with manufacturers.

--
Dmitry

Dmitry Torokhov

unread,

Dec 7, 2009, 1:20:01 AM12/7/09

to

And the bad thing is that violates multiple layers in the kernel. Atkbd
driver does not have to be using i8042; neither does psmouse. Althtough
they do in 99% of the cases there are other controllers providing the
i8042-style ports. Just grep for SERIO_8042 in drivers/input/serio.

I do not want to hard-code the i8042-psmouse-atkbd dependency.

--
Dmitry

Linus Torvalds

unread,

Dec 7, 2009, 1:20:02 AM12/7/09

to

On Sun, 6 Dec 2009, Linus Torvalds wrote:
>
> And drivers that want to do things asynchronously don't need to register

> or worry: all they do is literally [...]

Side note: for specific bus implementations, you obviously don't have to
even expose the choice. Things like the whole "suspend_late" and
"resume_early" phases don't make sense for USB devices, and the USB core
layer don't even expose those to the various USB drivers.

The same is true of the prepare_suspend/suspend split I'm proposing: I
suspect that for something like USB, it would make most sense to just do
normal node suspend in prepare_suspend, which would do everything
asynchronously. Only USB hub devices would get involved at the later
'suspend()' phase.

So I'm not suggesting that "all drivers" would necessarily even need
changing in order to take advantage of asynchronous behavior.

You could change just the _core_ USB layer would do everything
automatically for USB devices, and now USB devices would automatically
suspend asynchronously not because the generic device layer knows about
it, but because the USB bus layer chose to do that "async_run()" on the
leaf node suspend functions (or rather: a helper function that calls the
leaf-node suspend, and then does the 'up_read()' call on the parent
lock: the actual usb driverrs would never know about any of this).

Arjan van de Ven

unread,

Dec 7, 2009, 1:30:02 AM12/7/09

to

On Sun, 6 Dec 2009 22:15:49 -0800
Dmitry Torokhov <dmitry....@gmail.com> wrote:

> And the bad thing is that violates multiple layers in the kernel.
> Atkbd driver does not have to be using i8042; neither does psmouse.
> Althtough they do in 99% of the cases there are other controllers
> providing the i8042-style ports. Just grep for SERIO_8042 in
> drivers/input/serio.
>
> I do not want to hard-code the i8042-psmouse-atkbd dependency.

it's not a specific dependency.

it's a "I know I'm critical, so everything before me needs to be done".

that doesn't encode an actual relationship, it encodes a potential
relationship... with a worst case behavior of ... what we do right
now ;_)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Arjan van de Ven

unread,

Dec 7, 2009, 1:40:02 AM12/7/09

to

On Sun, 6 Dec 2009 21:57:55 -0800 (PST)
Linus Torvalds <torv...@linux-foundation.org> wrote:

>
> Now, the problem remains that when you walk the device tree starting
> off all these potentially asynchronous events, you don't want to do
> that serialization part (the "parent suspend") as you walk the tree -
> because then you would only ever do one single level asynchronously.
> Which is why I suggested splitting the suspend into a "pre-suspend"
> phase (and a "post-resume" one). Because then the tree walk goes from

> I dunno. Maybe I'm overlooking something, but the above is much

> closer to what I think would be worth doing.

with what you're describing I suspect the current async function calls
could be used;
in the first tree walk, the drivers do an async_schedule() of the
things they want done asynchronous;
all the core then needs to do is a full synchronization step between the
two tree walks... and we get pretty much all the benefits without
needing the read-then-write-lock primitive for synchronization.

alternative would be to do the synchronization in the part where we
know there's a dependency (like your lock is doing);
but instead of a lock we could store the async cookie there; and just
wait on that in the 2nd phase.... this would be more finegrained, and
an optimization from the "global synchronize"... but I'm not sure it'll
be worth it in practice; it will if there's significant cost in various
parts of the tree AND in the 2nd run; if the 2nd run is cheap in
general, you're not going to get real extra parallelism at the price of
more complexity.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

Dmitry Torokhov

unread,

Dec 7, 2009, 1:40:03 AM12/7/09

to

On Sun, Dec 06, 2009 at 10:31:12PM -0800, Arjan van de Ven wrote:
> On Sun, 6 Dec 2009 22:15:49 -0800
> Dmitry Torokhov <dmitry....@gmail.com> wrote:
>
> > And the bad thing is that violates multiple layers in the kernel.
> > Atkbd driver does not have to be using i8042; neither does psmouse.
> > Althtough they do in 99% of the cases there are other controllers
> > providing the i8042-style ports. Just grep for SERIO_8042 in
> > drivers/input/serio.
> >
> > I do not want to hard-code the i8042-psmouse-atkbd dependency.
>
> it's not a specific dependency.
>
> it's a "I know I'm critical, so everything before me needs to be done".
>
> that doesn't encode an actual relationship, it encodes a potential
> relationship... with a worst case behavior of ... what we do right
> now ;_)

This is the case with every parent device, isn't it? It is important for
its children. And wasn't Rafael patchset trying to address exactkly
this?

--
Dmitry

Rafael J. Wysocki

unread,

Dec 7, 2009, 10:20:01 AM12/7/09

to

On Monday 07 December 2009, Linus Torvalds wrote:
>
> On Mon, 7 Dec 2009, Zhang Rui wrote:
> >
> > Hi, Linus,
> > can you please look at this patch set and see if the idea is right?
> > http://marc.info/?l=linux-kernel&m=124840449826386&w=2
> > http://marc.info/?l=linux-acpi&m=124840456826456&w=2
> > http://marc.info/?l=linux-acpi&m=124840456926459&w=2
> > http://marc.info/?l=linux-acpi&m=124840457026468&w=2
> > http://marc.info/?l=linux-acpi&m=124840457126471&w=2
>
> So I'm not entirely sure about that patch-set, but the thing I like about
> it is how drivers really sign up to it one by one, rather than having all
> PCI devices automatically signed up for async behavior.
>
> That said, the thing I don't like about it is some of the same thing I
> don't necessarily like about the series in Rafael's tree either:

Just for the record, it's not in there any more.

I don't think the idea is really that much simpler than the one behind the
patchset you've just rejected. The only real difference is that in that
patchset the entire suspend and resume callbacks could be either
asynchronous or synchronous and in your approach each callback may
be devided into the synchronous and asynchronous part, which admittedly is
more flexible, but not necessarily simpler.

Now, apart from the idea there are some details that need to be taken into
consideration like the fact that the children may not be the only devices
you need to wait for with the parent suspend and that implies additional
locking rules. But you need to know which devices to lock and that has
to be represented somehow (the PM links in my patchset were for this and
_nothing_ else).

Also, it looks like the parent locking should rather be done at the core
level, as it appears to be a piece of code that needs to be called for each
device:

if (I_have_children || I_have_other_dependent_devices)
write_lock_myself

We already have prepare and complete suspend callbacks, for a different reason,
and I'm not sure they're suitable for doing the async thing.

So, we'd need to add another two callbacks, just for suspend to RAM, and what
about hibernation? Isn't that going to become a bit too complicated?

> The advantage: untouched drivers don't change ANY SEMANTICS AT ALL.

This also was true for my patchset.

> If they don't have a 'suspend_prepare()' function, then they still see that
> exact same sequence of 'suspend()' calls.

The same holded for drivers without the async_suspend flag set in my patchset
(I really should have left setting it to individual drivers).

> In fact, even if they have children that _do_ have drivers that have that
> async phase, they'll never know, because that simple write-semaphore
> trivially guarantees that whether there was async work or not, it will be
> completed by the time we call 'suspend()'.

Ditto.

> And drivers that want to do things asynchronously don't need to register
> or worry: all they do is literally
>
> - move their 'suspend()' function to 'suspend_prepare()' instead
>
> - add a
>
> down_read(dev->parent->lock);
> async_run(mysuspend, dev);
>
> to the point that they want to be asynchronous (which may be _all_ of
> it or just some slow part). The 'mysuspend' part would be the async
> part.
>
> - add a
>
> up_read(dev->parent->lock);
>
> to the end of their asynchronous 'mysuspend()' function, so that when
> the child has finished suspending, the parent down_write() will finally
> succeed.

In my patchset the drivers didn't need to do all that stuff. The only thing
they needed, if they wanted their suspend/resume to be executed
asynchronously, was to set the async_suspend flag.

But this is just for the record, in case you end up with code that's more
complicated than the rejected one.

Rafael

Alan Stern

unread,

Dec 7, 2009, 10:20:02 AM12/7/09

to

On Sun, 6 Dec 2009, Linus Torvalds wrote:

> # Phase one: walk the tree synchronously, starting any
> # async work on the leaves
> suspend_prepare(root)
> {
> for_each_child(root) {
> // This may take the parent lock for
> // reading if it does something async
> suspend_prepare(child);
> }
> suspend_prepare_one_node(root);
> }
>
> # Phase two: walk the tree synchronously, waiting for
> # and finishing the suspend
> suspend(root)
> {
> for_each_child(root) {
> suspend(child);
> }
> // This serializes with any async children started in phase 1
> write_lock(root->lock);
> suspend_one_node(root);
> write_unlock(root->lock);
> }
>
> and I really think this should work.

> No complex concepts. No change to existing tested drivers. No callbacks,

> no flags, no nothing. And a pretty simple way for a driver to decide: I'll
> do my suspends asynchronously (without parent drivers really ever even
> having to know about it).
>
> I dunno. Maybe I'm overlooking something, but the above is much closer to
> what I think would be worth doing.

You're overlooking resume. It's more difficult than suspend. The
issue is that a child can't start its async part until the parent's
synchronous part is finished.

So for example, suppose the device listing contains P, C, Q, where C is
a child of P, Q is unrelated, and P has a long-lasting asynchronous
requirement. The resume process will stall upon reaching C, waiting
for P to finish. Thus even though P and Q might be able to resume in
parallel, they won't get the chance.

An approach that handles resume well can probably be adapted to handle
suspend too. The reverse isn't true, as this example shows.

Alan Stern

Rafael J. Wysocki

unread,

Dec 7, 2009, 10:20:02 AM12/7/09

to

On Monday 07 December 2009, Dmitry Torokhov wrote:
> On Sun, Dec 06, 2009 at 10:31:12PM -0800, Arjan van de Ven wrote:
> > On Sun, 6 Dec 2009 22:15:49 -0800
> > Dmitry Torokhov <dmitry....@gmail.com> wrote:
> >
> > > And the bad thing is that violates multiple layers in the kernel.
> > > Atkbd driver does not have to be using i8042; neither does psmouse.
> > > Althtough they do in 99% of the cases there are other controllers
> > > providing the i8042-style ports. Just grep for SERIO_8042 in
> > > drivers/input/serio.
> > >
> > > I do not want to hard-code the i8042-psmouse-atkbd dependency.
> >
> > it's not a specific dependency.
> >
> > it's a "I know I'm critical, so everything before me needs to be done".
> >
> > that doesn't encode an actual relationship, it encodes a potential
> > relationship... with a worst case behavior of ... what we do right
> > now ;_)
>
> This is the case with every parent device, isn't it? It is important for
> its children. And wasn't Rafael patchset trying to address exactkly
> this?

Yes, it was.

Thanks,
Rafael

Alan Stern

unread,

Dec 7, 2009, 10:50:01 AM12/7/09

to

On Sun, 6 Dec 2009, Linus Torvalds wrote:

> I can imagine that doing USB resume specially is worth it, since USB is
> fundamentally a pretty slow bus. But USB is also a fairly clear hierarchy,
> so there is no point in PM groups or any other information outside of the
> pure topology.
>
> But there is absolutely zero point in doing that for devices in general.
> PCI drivers simply do not want concurrent initialization. The upsides are
> basically zero (win a few msecs?) and the downsides are the pointless
> complexity. We don't do PCI discovery asyncronously either - for all the
> same reasons.
>
> Now, a PCI driver may then implement a bus that is slow (ie SCSI, ATA,
> USB), and that bus may itself then want to do something else. If it really
> is a good idea to add the whole hierarchical model to USB suspend/resume,
> I can live with that, but that is absolutely no excuse for then doing it
> for cases where the hierarchy is (a) known to be broken (ie the whole PCI
> multifunction thing, but also things like motherboard power management
> devices) and (b) don't have the same kind of slow bus issues.

Okay. I can understand not wanting to burden everybody else with USB's
weaknesses.

Simply doing an async suspend & resume of each USB root hub might be
enough to give a significant advantage. For the most part these root
hubs tend to be registered sequentially with few or no other devices in
between.[*] Hence the "stalls" that would occur when suspending a
parent or resuming a child wouldn't slow things down very much. We
would not always reap the maximum advantage of a fully-asyncronous
approach but there would be some improvement.

This is sort of what Arjan suggested yesterday. Its benefit is that
nothing outside usbcore has to change.

Alan Stern

[*] In fact this is true only on systems where the USB host controller
drivers are built as modules. If everything is compiled into the
kernel then the devices are registered in the worst possible order:
controller 1, root hub 1, controller 2, root hub 2, ...

I suppose the root hubs could be registered in a delayed work routine.
It would be a little awkward but it would solve this issue.

Linus Torvalds

unread,

Dec 7, 2009, 11:40:02 AM12/7/09

to

On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> > The advantage: untouched drivers don't change ANY SEMANTICS AT ALL.
>
> This also was true for my patchset.

That's simply not trye.

You set async_suspend on every single PCI driver. I object very heavily to
it.

You also introduce this whole big "callback when ready", and
"non-topoligical PM dependency chain" thing. Which I also object to.

Notice how with the simpler "lock parent" model, you _can_ actually encode
non-topological dependencies, but you do it by simply read-locking
whatever other independent device you want. So if an architecture has some
system devices that have odd rules, that architecture can simply encode
those rules in its suspend() functions.

It doesn't need to expose it to the device layer - because the device
layer won't even care. The code will just automatically "do the right
thing" without even having that notion of PM dependencies at any other
level than the driver that knows about it.

No registration, no callbacks, no nothing.

> In my patchset the drivers didn't need to do all that stuff. The only thing
> they needed, if they wanted their suspend/resume to be executed
> asynchronously, was to set the async_suspend flag.

In my patchset, the drivers don't need to either.

The _only_ thing that would do this is something like the USB layer. We're
talking ten lines of code or so. And you get rid of all the PM
dependencies and all the infrastructure - because the model is so simple
that it doesn't need any.

(Well, except for the infrastructure to run things asynchronously, but
that was kind of my point from the very beginning: we can just re-use all
that existing async infrastructure. We already have that).

Linus

Linus Torvalds

unread,

Dec 7, 2009, 11:40:02 AM12/7/09

to

On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > I dunno. Maybe I'm overlooking something, but the above is much closer to
> > what I think would be worth doing.
>
> You're overlooking resume. It's more difficult than suspend. The
> issue is that a child can't start its async part until the parent's
> synchronous part is finished.

No, I haven't overlooked resume at all. I just assumed that it was
obvious. It's the exact same thing, except in reverse (the locking ends
up being slightly different, but the changes are actually fairly
straightforward).

And by reverse, I mean that you walk the tree in the reverse order too,
exactly like we already do - on suspend we walk it children-first, on
resume we walk it parents-first (small detail: we actually just walk a
simple linked list, but the list is topologically ordered, so walking it
forwards/backwards is topologically the same thing as doing that
depth-first search).

> So for example, suppose the device listing contains P, C, Q, where C is
> a child of P, Q is unrelated, and P has a long-lasting asynchronous
> requirement. The resume process will stall upon reaching C, waiting
> for P to finish. Thus even though P and Q might be able to resume in
> parallel, they won't get the chance.

No. The resume process does EXCTLY THE SAME THING as I outlined for
suspend, but just all in reverse. So now the resume process becomes the
same two-phase thing:

# Phase one
resume(root)
{
// This can do things asynchronously if it wants,
// but needs to take the write lock on itself until
// it is done if it does
resume_one_node(root);
for_each_child(root)
resume(child);
}

# Phase two
post_resume(root)
{
post_resume_one_node(root);
for_each_child(root)
post_resume(child);
}

Notice? It's _exactly_ the same thing as suspend - except all turned
around. We do the nodes before the children ("walk the list backwards"),
and we also do the locking the other way around (ie on suspend we'd lock
the _parent_ if we wanted to do async stuff - to keep it around - but on
resume we lock _ourselves_, so that the children can have something to
wait on. Also note how we take a _write_ lock rather than a read lock).

(And again, I've only written it out in email, I've not tested it or
thought about it all that deeply, so you'll excuse any stupid thinkos.)

Now, for something like PCI, I'd suggest (once more) leaving all drivers
totally unchanged, and you end up with the exact same behavior as we had
before (no real change to the whole resume ordering, and everything is
synchronous so there is no relevant locking).

But how would the USB layer do this?

Simple: all the normal leaf devices would have their resume callback be
called at "post_resume()" time (exactly the reverse of the suspend phase:
we suspend early, and we resume late - it's all a mirror image). And I'd
suggest that the USB layer do it all totally asynchronously, except again
turned around the other way.

Remember how on suspend, the suspend of a leaf device ended up being an
issue of asynchronously calling a function that did the suspend, and then
released the read-lock of the parent. Resume is the same, except now we'd
actually want to take the parent read-lock asynchronously too, so you'd do

down_write(leaf->lock);
async_schedule(usb_node_resume, leaf);

where that function simply does

usb_node_resume(node)
{
/* Wait for the parent to have resumed completely */
down_read(node->parent->lock);
node->resume(node)
up_read(node->parent->lock);
up_write(node->lock);
}

and you're all done. Once more the ordering and the locking takes care of
any need to serialize - there is no data structures to keep track of.

And what about USB hubs? They get resumed in the first phase (again,
exactly the mirror image of the suspend), and the only thing they need to
do is that _exact_ same thing above:

down_write(hub->lock);
async_schedule(usb_node_resume, hub);

- Ta-daa! All done.

Notice? It's really pretty straightforward, and there are _zero_ new
concepts. And again, no callbacks, no nothing. Just the obvious mirror
image of what happened when suspending. We do everything with simple async
calls. And none of the tree walking actually blocks (yes, we do a
"down_write()" on the nodes as we schedule the resume code, but it won't
be a blocking one, since that is the first time we encounter that node:
the blocking will come later when the async threads actually need to wait
for things).

Again, I do not guarantee that I've dotted every i, and crossed every t.
It's just that I'm pretty sure that we really don't need any fancy
"infrastructure" for something this simple. And I really much prefer
"conceptually simple high-level model" over a model of "keep track of all
the relationships and have some complex model of devices".

So let's just look at your example:

> So for example, suppose the device listing contains P, C, Q, where C is
> a child of P, Q is unrelated, and P has a long-lasting asynchronous
> requirement.

The tree is:

... -> P -> C
-> Q

and with what I suggest, during phase one, P will asynchronously start the
resume. As part of its async resume it will have to wait for it's parents,
of course, but all of that happens in a separate context, and the tree
traversal goes on.

And during phase #1, C and Q won't do anything at all. We _could_ do them
during this phase, and it would actually all work out fine, but we
wouldn't want to do that for a simple reason: we _want_ the pre_suspend
and post_resume phases to be total mirror images, because if we end up
doing error handling for the pre-suspend case, then the post-resume phase
would be the "fixup" for it, so we actually want leaf things to happen
during phase #2 - not because it would screw up locking or ordering, but
because of other issues.

When we hit phase #2, we then do C and Q, and do the same thing - we have
an async call that does the read-lock on the parent to make sure it's
all resumed, and then we resume C and Q. And they'll automatically resume
in parallel (unless C is waiting for P, of course, in which case P and Q
end up resuming in parallel, and C ends up waiting).

Now, the above just takes care of the inter-device ordering. There are
unrelated semantics we want to give, like "all devices will have resumed
before we start waking up user space". Those are unrelated to the
topological requirements, of course, and are not a requirement imposed by
the device tree, but by our _other_ semantics (IOW, in this respect it's
kind of like how we wanted pre-suspend and post-resume to be mirror images
for other outside reasons).

So we'd actually have a "phase #3", but that phase wouldn't be visible to
the devices themselves, it would be a

# Phase tree: make sure everything is resumed
for_each_device() {
read_lock(dev->lock);
read_unlock(dev->lock);
}

but as you can see, there's no actual device callbacks involved. It would
be just the code device layer saying "ok, now I'm going to wait for all
the devices to have finished their resume".

Linus

Linus Torvalds

unread,

Dec 7, 2009, 12:00:03 PM12/7/09

to

On Mon, 7 Dec 2009, Linus Torvalds wrote:
>
> And during phase #1, C and Q won't do anything at all. We _could_ do them
> during this phase, and it would actually all work out fine, but we
> wouldn't want to do that for a simple reason: we _want_ the pre_suspend
> and post_resume phases to be total mirror images, because if we end up
> doing error handling for the pre-suspend case, then the post-resume phase
> would be the "fixup" for it, so we actually want leaf things to happen
> during phase #2 - not because it would screw up locking or ordering, but
> because of other issues.

Ho humm.

This part made me think. Since I started mulling over the fact that we
could do the resume thing in a single phase (and really only wanted the
second phase in order to be a mirror image to the suspend), I started
thinking that we could perhaps do even the suspend with a single phase,
and avoid introducing that pre-suspend/post-resume phase at all.

And now that I think about it, we can do that by simply changing the
locking just a tiny bit.

I originally envisioned that two-pase suspend because I was thinking that
the first phase would start off the suspend, and the second phase would
finish it, but we can actually do it all with a single phase that does
both. So starting with just the regular depth-first post-ordering that is
a suspend:

suspend(root)
{
for_each_child(root)
suspend(child);
suspend_one_node(root)
}

the rule would be that for something like USB that wants to do the suspend
asynchronously, the node suspend routine would do

usb_node_suspend(node)
{
// Make sure parent doesn't suspend: this will not block,
// because we'll call the 'suspend' function for all nodes
// before we call it for the parent.
down_read(node->parent->lock);

// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);
}

do_usb_node_suspend(node)
{
// Start out suspend. This will block if we have any
// children that are still busy suspending (they will
// have done a down_read() in their suspend).
down_write(node->lock);
node->suspend(node);
up_write(node->lock);

// This lets our parent continue
up_read(node->parent->lock);
}

and it looks like we don't even need a second phase at all.

IOW, I think USB could do this on its own right now, with no extra
infrastructure from the device layer AT ALL, except for one small thing:
that new "rwsem" lock in the device data structure, and then we'd need the
"wait for everybody to have completed" loop, ie

for_each_dev(dev) {
down_write(dev->lock);
up_write(dev->lock);
}

thing at the end of the suspend loop (same thing as I mentioned about
resuming).

So I think even that whole two-phase thing was unnecessarily complicated.

Alan Stern

unread,

Dec 7, 2009, 1:00:02 PM12/7/09

to

On Mon, 7 Dec 2009, Linus Torvalds wrote:

> No, I haven't overlooked resume at all. I just assumed that it was
> obvious. It's the exact same thing, except in reverse (the locking ends
> up being slightly different, but the changes are actually fairly
> straightforward).
>
> And by reverse, I mean that you walk the tree in the reverse order too,
> exactly like we already do - on suspend we walk it children-first, on
> resume we walk it parents-first (small detail: we actually just walk a
> simple linked list, but the list is topologically ordered, so walking it
> forwards/backwards is topologically the same thing as doing that
> depth-first search).

> Notice? It's _exactly_ the same thing as suspend - except all turned

> around. We do the nodes before the children ("walk the list backwards"),
> and we also do the locking the other way around (ie on suspend we'd lock
> the _parent_ if we wanted to do async stuff - to keep it around - but on
> resume we lock _ourselves_, so that the children can have something to
> wait on. Also note how we take a _write_ lock rather than a read lock).

Okay, I think I've got it. But you're wrong about one thing: Resume
isn't _exactly_ the reverse of suspend. For both of them we have to
start the async thread in the first pass. So instead of
resume/post_resume we would have pre_resume/resume, just like
pre_suspend/suspend.

During the pre- pass, the driver launches an async thread and takes the
appropriate locks. The thread does its work as appropriate (with
locking to insure that it first waits for children or parents), and
then in the second pass the driver waits for the async thread to
finish.

A non-async driver (i.e., most of them) would ignore the pre- pass
entirely and do all its work in the second pass.

An async-aware driver would look like this:

pre_suspend(dev)
{
/* Prevent parent from suspending until we are ready */
down_read(dev->parent->lock);
dev->pm_cookie = async_schedule(async_suspend, dev);
}

async_suspend(dev)
{
/* Wait until all children are fully suspended */
down_write(dev->lock);
Suspend dev, taking as much time as needed
up_write(dev->lock);

/* Allow parent to suspend */
up_read(dev->parent->lock);
}

suspend(dev)
{
/* Wait until the suspend is complete */
async_synchronize_cookie(dev->pm_cookie);
}

pre_resume(dev)
{
/* Prevent children from resuming */
down_write(dev->lock);
dev->pm_cookie = async_schedule(async_resume, dev);
}

async_resume(dev)
{
/* Wait until parent is fully resumed */
down_read(dev->parent->lock);
Resume dev, taking as much time as needed
up_read(dev->parent->lock);

/* Allow children to resume */
up_write(dev->lock);
}

resume(dev)
{
/* Wait until resume is complete */
async_synchronize_cookie(dev->pm_cookie);
}

So there's some time symmetry here, but it isn't perfect. This is
probably what you had in mind all along, but I needed to get it
straight.

There's some question about what to do if a suspend or resume fails. A
bunch of async threads will have been launched for other devices, but
now there won't be anything to wait for them. It's not clear how this
should be handled.

Alan Stern

Linus Torvalds

unread,

Dec 7, 2009, 1:10:02 PM12/7/09

to

On Mon, 7 Dec 2009, Alan Stern wrote:
>
> Okay, I think I've got it. But you're wrong about one thing: Resume
> isn't _exactly_ the reverse of suspend.

Yeah, no. But I think I made it much closer by getting rid of pre-suspend
and post-resume (my next email to the one you quoted).

And yeah, I started thinking along those lines exactly because it wasn't
as clean a mirror image as I thought it should be able to be.

> A non-async driver (i.e., most of them) would ignore the pre- pass
> entirely and do all its work in the second pass.

See my second email, where I think I can get rid of the whole second pass
thing. I think you'll agree that it's an even nicer mirror image.

> There's some question about what to do if a suspend or resume fails. A
> bunch of async threads will have been launched for other devices, but
> now there won't be anything to wait for them. It's not clear how this
> should be handled.

I think the rule for "suspend fails" is very simple: you can't fail in the
async codepath. There's no sane way to return errors, and trying to would
be too complex anyway. What would you do?

In fact, even though we _can_ fail in the synchronous path, I personally
consider a device driver that ever fails its suspend to be terminally
broken. We're practically always better off suspending and simply turning
off the power than saying "uh, I failed the suspend".

I've occasionally hit a few drivers that caused suspend failures, and each
and every time it was a driver bug, and the right thing to do was to just
ignore the error and suspend anyway - returning an error code and trying
to undo the suspend is not what anybody ever really wants, even if our
model _allows_ for it.

(And the rule for "resume fails" is even simpler: there's nothing we can
really do if something fails to resume - and that's true whether the
failure is synchronous or asynchronous. The device is dead. Try to reset
it, or remove it from the device tree. Tough).

Linus

Alan Stern

unread,

Dec 7, 2009, 3:40:03 PM12/7/09

to

On Mon, 7 Dec 2009, Linus Torvalds wrote:

> See my second email, where I think I can get rid of the whole second pass
> thing. I think you'll agree that it's an even nicer mirror image.

Yes, I like this approach better and better.

There is still a problem. In your code outlines, you have presented a
classic depth-first (suspend) or depth-last (resume) tree algorithm.
But that's not how the PM core works. Instead it maintains dpm_list, a
list of all devices in order of registration. Suspends and resumes are
carried out by iterating along this list, in the reverse and forward
directions respectively.

There are two advantages. The matter of stack usage, of course. But
more importantly, this order of devices is guaranteed to work. For any
device D, we _know_ that the system can function properly in
circumstances where everything on dpm_list before D is active and
everything after D is inactive -- because that's the state the system
was in when D was registered. Any other order risks errors because of
unknown dependencies.

The consequence is that there's no way to hand off an entire subtree to
an async thread. And as a result, your single-pass algorithm runs into
the kind of "stall" problem I described before.

(In theory we could convert over to a tree algorithm. IMO that would
be nearly as dangerous as going to a full-fledged totally async
scheme.)

But all is not lost. We can still get what we want using a two-pass
list algorithm, where one of the passes is contained within the PM core
-- no extra callbacks are needed. Here's how suspend would work:

dpm_suspend() /* Suspend all devices on dpm_list */
{
list_for_each_entry_reverse(dev, dpm_list, ...) {

/* Make the parent wait for dev */
down_read(dev->parent->lock);
if (dev->async_pm)
async_schedule(device_suspend, dev);
}
list_for_each_entry_reverse(dev, dpm_list, ...) {
if (!dev->async_pm)
device_suspend(dev);
}
async_synchronize_full();
}

device_suspend(dev) /* Suspend a single device */
{
/* Wait until all the children are suspended */
down_write(dev->lock);
dev->bus->suspend(dev);
up_write(dev->lock);

/* Tell the parent we are finished */
up_read(dev->parent->lock);
}

I have glossed over a bunch of details, such as the fact that
device_suspend() really takes two arguments. And it's necessary to be
more careful with the list operations than shown here, because devices
can be unregistered while all this is going on.

Still, this seems reasonable. Bus subsystems and drivers can set the
dev->async_pm flag as desired, and they can use the new rwsems to
handle special dependencies without involving the PM core. No new
callbacks are needed, nor any changes to existing methods.
(Convincing lockdep that all this fancy footwork is valid may require
some effort, though.)

By the way, this bears a striking resemblance to Rafael's patch. The
biggest difference is the use of the new rwsem for dependency
resolution, instead his somewhat cumbersome constraint structures.

> > There's some question about what to do if a suspend or resume fails. A
> > bunch of async threads will have been launched for other devices, but
> > now there won't be anything to wait for them. It's not clear how this
> > should be handled.
>
> I think the rule for "suspend fails" is very simple: you can't fail in the
> async codepath. There's no sane way to return errors, and trying to would
> be too complex anyway. What would you do?

You could prevent the suspend procedure from going any further and
abort the entire system sleep. If you wanted to.

> In fact, even though we _can_ fail in the synchronous path, I personally
> consider a device driver that ever fails its suspend to be terminally
> broken. We're practically always better off suspending and simply turning
> off the power than saying "uh, I failed the suspend".
>
> I've occasionally hit a few drivers that caused suspend failures, and each
> and every time it was a driver bug, and the right thing to do was to just
> ignore the error and suspend anyway - returning an error code and trying
> to undo the suspend is not what anybody ever really wants, even if our
> model _allows_ for it.

There is a valid reason for aborting a sleep transition: the driver has
received a remote wakeup request. Wakeup requests race with sleep, of
course. A request coming after the system is asleep will wake it up;
one coming before the system is asleep should either cause it to wake
up immediately after shutting down or prevent the sleep entirely.

Causing the system to wake up immediately needs hardware support. But
by the time the kernel is aware of a wakeup request, the request is
generally no longer present in the hardware. (For example, an
interrupt has been delivered and the IRQ line is no longer active.)
So the only remaining choice is to abort the sleep transition.

> (And the rule for "resume fails" is even simpler: there's nothing we can
> really do if something fails to resume - and that's true whether the
> failure is synchronous or asynchronous. The device is dead. Try to reset
> it, or remove it from the device tree. Tough).

Right.

Alan Stern

Rafael J. Wysocki

unread,

Dec 7, 2009, 3:50:02 PM12/7/09

to

On Monday 07 December 2009, Linus Torvalds wrote:
>

> On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > The advantage: untouched drivers don't change ANY SEMANTICS AT ALL.
> >
> > This also was true for my patchset.
>
> That's simply not trye.
>
> You set async_suspend on every single PCI driver. I object very heavily to
> it.

That was a mistake, I admit.

However, it was done in a separate patch that (1) was not necessary and (2)
shouldn't have been there. Sorry for making the mistake of including that into
the patchset. So I understand your objection to that and let's not get back to
this again, ok?

> You also introduce this whole big "callback when ready", and
> "non-topoligical PM dependency chain" thing. Which I also object to.

These things are also non-essential. Acutally they wasn't there in the initial
version of my patches and were added after people had complained that it had
not been parallel enough and hadn't take the off-tree dependecies into account.

I could remove these things either and quite easily.

> Notice how with the simpler "lock parent" model, you _can_ actually encode
> non-topological dependencies, but you do it by simply read-locking
> whatever other independent device you want. So if an architecture has some
> system devices that have odd rules, that architecture can simply encode
> those rules in its suspend() functions.

I'm not arguing against that. In fact, my only worry were that additional
suspend/resume callbacks I really wouldn't like to introduce. But since you've
found a way of doing things without them, I'm totally fine with this approach.

> It doesn't need to expose it to the device layer - because the device
> layer won't even care. The code will just automatically "do the right
> thing" without even having that notion of PM dependencies at any other
> level than the driver that knows about it.
>
> No registration, no callbacks, no nothing.
>
> > In my patchset the drivers didn't need to do all that stuff. The only thing
> > they needed, if they wanted their suspend/resume to be executed
> > asynchronously, was to set the async_suspend flag.
>
> In my patchset, the drivers don't need to either.
>
> The _only_ thing that would do this is something like the USB layer. We're
> talking ten lines of code or so. And you get rid of all the PM
> dependencies and all the infrastructure - because the model is so simple
> that it doesn't need any.

It just uses a different way of representing these things, perhaps more
efficiently.

> (Well, except for the infrastructure to run things asynchronously, but
> that was kind of my point from the very beginning: we can just re-use all
> that existing async infrastructure. We already have that).

So I guess the only thing we need at the core level is to call
async_synchronize_full() after every stage of suspend/resume, right?

Rafael

Linus Torvalds

unread,

Dec 7, 2009, 3:50:03 PM12/7/09

to

On Mon, 7 Dec 2009, Alan Stern wrote:
>
> Yes, I like this approach better and better.
>
> There is still a problem. In your code outlines, you have presented a
> classic depth-first (suspend) or depth-last (resume) tree algorithm.

Yes, I did that because that clarifies the locking rules (ie "we traverse
parents nodes last/first"), not because it was actually relevant to
anything else.

And the whole pre-order vs post-order is important, and really only shows
up when you show the pseudo-code as a tree walk.

> But that's not how the PM core works. Instead it maintains dpm_list, a
> list of all devices in order of registration.

Right. I did mention that in a couple of the asides, I'm well aware that
we don't actually traverse the tree as a tree.

But the "traverse list forward" is logically the same thing as doing
a pre-order DFS, while going backwards is equivalent to doing a post-order
DFS, since all we really care about is the whole "parent first" or
"children first" part of the ordering.

So I wanted to show the logic in pseudo-code using the tree walk (because
that explains the logic _conceptually_ much better), but the actual code
would just do the list traversal.

> The consequence is that there's no way to hand off an entire subtree to
> an async thread. And as a result, your single-pass algorithm runs into
> the kind of "stall" problem I described before.

No, look again. There's no stall in the thing, because all it really
depends on is (for the suspend path) is that it sees all children before
the parent (because the child will do a "down_read()" on the parent node
and that should not stall), and for the resume path it depends on seeing
the parent node before any children (because the parent node does that
"down_write()" on its own node).

Everything else is _entirely_ asynchronous, including all the other locks
it takes. So there are no stalls (except, of course, if we then hit limits
on numbers of outstanding async work and refuse to create too many
outstanding async things, but that's a separate issue, and intentional, of
course).

You're right that my first one (two-phase suspend) had a stall situation.

Linus

Linus Torvalds

unread,

Dec 7, 2009, 4:00:02 PM12/7/09

to

On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> So I guess the only thing we need at the core level is to call
> async_synchronize_full() after every stage of suspend/resume, right?

Yes and no.

Yes in the sense that _if_ everybody always uses "async_schedule()" (or
whatever the call is named - I've really only written pseudo-code and
haven't even tried to look up the details), then the only thing you need
to do is async_synchronize_full().

But one of the nice things about using just the trivial rwlock model and
letting any async users just depend on that is that we could easily just
depend entirely on those device locks, and allow drivers to do async
shutdowns other ways too.

For example, I could imagine some driver just doing an async suspend (or
resume) that gets completed in an interrupt context, rather than being
done by 'async_schedule()' at all.

So in many ways it's nicer to serialize by just doing

serialize_all_PM_events()
{
for_each_device() {
down_write(dev->lock);
up_write(dev->lock);
}
}

rather than depend on something like async_synchronize_full() that
obviously waits for all async events, but doesn't have the capability to
wait for any other event that some random driver might be using.

[ That "down+up" is kind of stupid, but I don't think we have a "wait for
unlocked" rwsem operation. We could add one, and it would be cheaper for
the case where the device never did anything async at all, and didn't
really need to dirty that cacheline by doing that write lock/unlock
pair. ]

But that really isn't a big deal. I think it would be perfectly ok to also
just say "if you do any async PM, you need to use 'async_schedule()'
because that's all we're going to wait for". It's probably perfectly fine.

Linus

Alan Stern

unread,

Dec 7, 2009, 4:40:01 PM12/7/09

to

On Mon, 7 Dec 2009, Linus Torvalds wrote:

> > The consequence is that there's no way to hand off an entire subtree to
> > an async thread. And as a result, your single-pass algorithm runs into
> > the kind of "stall" problem I described before.
>
> No, look again. There's no stall in the thing, because all it really
> depends on is (for the suspend path) is that it sees all children before
> the parent (because the child will do a "down_read()" on the parent node
> and that should not stall), and for the resume path it depends on seeing
> the parent node before any children (because the parent node does that
> "down_write()" on its own node).
>
> Everything else is _entirely_ asynchronous, including all the other locks
> it takes. So there are no stalls (except, of course, if we then hit limits
> on numbers of outstanding async work and refuse to create too many
> outstanding async things, but that's a separate issue, and intentional, of
> course).

It only seems that way because you didn't take into account devices
that suspend synchronously but whose children suspend asynchronously.

A synchronous suspend routine for a device with async child suspends
would have to look just like your usb_node_suspend():

suspend_one_node(dev)
{
/* Wait until the children are suspended */
down_write(dev->lock);
Suspend dev
up_write(dev->lock);

/* Allow the parent to suspend */
up_read(dev->parent->lock);
}

So now suppose we've got two USB host controllers, A and B. They are
PCI devices, so they suspend synchronously. Each has a root hub child
(P and Q respectively) which is a USB device and therefore suspends
asynchronously. dpm_list contains: A, P, B, Q. (In fact A doesn't
enter into this discussion; you can ignore it.)

In your one-pass algorithm, we start with usb_node_suspend(Q). It does
down_read(B->lock) and starts an async task for Q. Then we move on to
suspend_one_node(B). It does down_write(B->lock) and blocks until the
async task finishes; then it suspends B. Finally we move on to
usb_node_suspend(P), which does down_read(A->lock) and starts an async
task for P.

The upshot is that P is stuck waiting for Q to suspend, even though it
should have been able to suspend in parallel. This is simply because P
precedes B in the list, and B is synchronous and must wait for Q to
finish.

With my two-pass algorithm, we start with Q. The first loop does
down_read(B->lock) and starts an async task for Q. We move on to B and
do down_read(B->parent->lock), nothing more. Then we move to to P,
with down_read(A->lock) and start an async task for P. Finally we do
down_read(A->parent->lock). Notice that now there are two async tasks,
for P and Q, running in parallel.

The second pass waits for Q to finish before suspending B
synchronously, and waits for P to finish before suspending A
synchronously. This is unavoidable. The point is that it allows P and
Q to suspend at the same time, not one after the other as in the
one-pass scheme.

Alan Stern

Linus Torvalds

unread,

Dec 7, 2009, 4:50:02 PM12/7/09

to

On Mon, 7 Dec 2009, Alan Stern wrote:
>
> It only seems that way because you didn't take into account devices
> that suspend synchronously but whose children suspend asynchronously.

But why would I care? If somebody suspends synchronously, then that's what
he wants.

> A synchronous suspend routine for a device with async child suspends
> would have to look just like your usb_node_suspend():

Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
Don't do that.

Make the USB host controller do its suspend asynchronously. We don't
suspend PCI bridges anyway, iirc (but I didn't actually check). And at
worst, we can make the PCI _bridges_ know about async suspends, and solve
it that way - without actually making any normal PCI drivers do it.

Linus

Rafael J. Wysocki

unread,

Dec 7, 2009, 4:50:03 PM12/7/09

to

On Monday 07 December 2009, Linus Torvalds wrote:
>

> On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > It only seems that way because you didn't take into account devices
> > that suspend synchronously but whose children suspend asynchronously.
>
> But why would I care? If somebody suspends synchronously, then that's what
> he wants.
>
> > A synchronous suspend routine for a device with async child suspends
> > would have to look just like your usb_node_suspend():
>
> Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
> Don't do that.
>
> Make the USB host controller do its suspend asynchronously. We don't
> suspend PCI bridges anyway, iirc (but I didn't actually check).

That's correct, we don't.

Rafael

Linus Torvalds

unread,

Dec 7, 2009, 5:10:02 PM12/7/09

to

On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > Make the USB host controller do its suspend asynchronously. We don't

> > suspend PCI bridges anyway, iirc (but I didn't actually check). And at
> > worst, we can make the PCI _bridges_ know about async suspends, and solve
> > it that way - without actually making any normal PCI drivers do it.
>

> This sounds suspiciously like pushing the problem up a level and
> hoping it will go away. (Sometimes that even works.)

The "we don't suspend bridges anyway" is definitely a "hoping it will go
away" issue. I think we did suspend bridges for a short while during the
PM switch-over some time ago, and it worked most of the time, and then on
some machines it just didn't work at all. Probably because ACPI ends up
touching registers behind bridges that we closed down etc.

So PCI bridges are kind of special. Right now we don't touch them, and if
we ever do, that will be another issue.

> In the end it isn't a very big issue. Using one vs. two passes in
> dpm_suspend() is pretty unimportant.

I also suspect that even if you do the USB host controller suspend
synchronously, doing the actual USB devices asynchronously would still
help - even if it's only "asynchronously per bus" thing.

So in fact, it's probably a good first step to start off doing only the
USB devices, not the controller.

Linus

Alan Stern

unread,

Dec 7, 2009, 5:10:02 PM12/7/09

to

On Mon, 7 Dec 2009, Linus Torvalds wrote:

> On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > It only seems that way because you didn't take into account devices
> > that suspend synchronously but whose children suspend asynchronously.
>
> But why would I care? If somebody suspends synchronously, then that's what
> he wants.

It doesn't mean he wants to block unrelated devices from suspending
asynchronously, merely because they happen to come earlier in the list.

> > A synchronous suspend routine for a device with async child suspends
> > would have to look just like your usb_node_suspend():
>
> Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
> Don't do that.
>
> Make the USB host controller do its suspend asynchronously. We don't
> suspend PCI bridges anyway, iirc (but I didn't actually check). And at
> worst, we can make the PCI _bridges_ know about async suspends, and solve
> it that way - without actually making any normal PCI drivers do it.

This sounds suspiciously like pushing the problem up a level and

hoping it will go away. (Sometimes that even works.)

In the end it isn't a very big issue. Using one vs. two passes in
dpm_suspend() is pretty unimportant.

Alan Stern

P.S.: In fact I planned all along to handle USB host controllers
asynchronously anyway, since their resume routines contain some long
delays. I was merely using them as an example.

Rafael J. Wysocki

unread,

Dec 7, 2009, 5:10:02 PM12/7/09

to

On Monday 07 December 2009, Linus Torvalds wrote:
>

> On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > It only seems that way because you didn't take into account devices
> > that suspend synchronously but whose children suspend asynchronously.
>
> But why would I care? If somebody suspends synchronously, then that's what
> he wants.
>
> > A synchronous suspend routine for a device with async child suspends
> > would have to look just like your usb_node_suspend():
>
> Sure. But that sounds like a "Doctor, it hurts when I do this" situation.
> Don't do that.
>
> Make the USB host controller do its suspend asynchronously. We don't
> suspend PCI bridges anyway, iirc (but I didn't actually check). And at
> worst, we can make the PCI _bridges_ know about async suspends, and solve
> it that way - without actually making any normal PCI drivers do it.

BTW, I still don't quite understand why not to put the parent's down_write
operation into the core. It's not going to hurt for the "synchronous" devices
and the "asynchronous" ones will need to do it anyway.

Also it looks like that's something to do unconditionally for all nodes
having children, because the parent need not know if the children do async
operations.

Rafael

Linus Torvalds

unread,

Dec 7, 2009, 5:20:02 PM12/7/09

to

On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
>
> BTW, I still don't quite understand why not to put the parent's down_write
> operation into the core. It's not going to hurt for the "synchronous" devices
> and the "asynchronous" ones will need to do it anyway.

That's what I started out doing (see the first pseudo-code with the two
phases). But it _does_ actually hurt.

Because it will hurt exactly for the "multiple hubs" case: if you have two
USB hubs in parallel (and the case that Alan pointed out about a USB host
bridge is the exact same deal), then you want to be able to suspend and
resume those two independent hubs in parallel too.

But if you do the "down_write()" synchronously in the core, that means
that you are also stopping the whole "traverse the tree" thing - so now
you aren't handling the hubs in parallel even if you are handling all the
devices _behind_ them asynchronously.

This "serialize while traversing the tree" was what I was initially trying
to avoid with the two-phase approach, but that I realized (after writing
the resume path) that I could avoid much better by just moving the parents
down_write into the asynchronous path.

> Also it looks like that's something to do unconditionally for all nodes
> having children, because the parent need not know if the children do async
> operations.

True, and that was (again) the first iteration. But see above: in order to
allow way more concurrency, you don't want to introduce the false
dependency between the write-lock and the traversal of the tree (or, as
Alan points out - just a list - but that doesn't really change anything)
that is introduced by taking the lock synchronously.

So by moving the write-lock to the asynchronous work that also shuts down
the parent, you avoid that whole unnecessary serialization. But that means
that you can't do the lock in generic code.

Unless you want to do _all_ of the async logic in generic code and
re-introduce the "dev->async_suspend" flag. I would be ok with that now
that the infrastructure seems so simple.

Linus

Alan Stern

unread,

Dec 7, 2009, 5:30:02 PM12/7/09

to

On Mon, 7 Dec 2009, Linus Torvalds wrote:

> I also suspect that even if you do the USB host controller suspend
> synchronously, doing the actual USB devices asynchronously would still
> help - even if it's only "asynchronously per bus" thing.
>
> So in fact, it's probably a good first step to start off doing only the
> USB devices, not the controller.

Interesting you should say that. The patch I asked Arjan to test
involved not suspending USB devices at all (root hubs being the
exception). That is in fact just what we do when CONFIG_USB_SUSPEND
isn't set.

There's no need to suspend the individual devices when the whole system
is going down. They will automatically suspend when the controller
stops sending out SOF packets, which occurs when the root hub is
suspended. The USB spec describes this, grandiosely, as a "global
suspend".

But yes, I agree. Doing just the USB devices is a good first step.

Alan Stern

Linus Torvalds

unread,

Dec 7, 2009, 5:30:02 PM12/7/09

to

On Mon, 7 Dec 2009, Alan Stern wrote:
>
> There's no need to suspend the individual devices when the whole system
> is going down. They will automatically suspend when the controller
> stops sending out SOF packets, which occurs when the root hub is
> suspended. The USB spec describes this, grandiosely, as a "global
> suspend".

Ahh, but the sync vs async would then still matter on resume. No?

Linus

Alan Stern

unread,

Dec 7, 2009, 6:20:02 PM12/7/09

to

On Mon, 7 Dec 2009, Linus Torvalds wrote:

>
>
> On Mon, 7 Dec 2009, Alan Stern wrote:
> >
> > There's no need to suspend the individual devices when the whole system
> > is going down. They will automatically suspend when the controller
> > stops sending out SOF packets, which occurs when the root hub is
> > suspended. The USB spec describes this, grandiosely, as a "global
> > suspend".
>
> Ahh, but the sync vs async would then still matter on resume. No?

That's complicated. If we assume the devices weren't runtime-suspended
before the sleep began, then they would automatically resume themselves
when the controller started transmitting EOF packets. So in that case
resume would be fast and async wouldn't matter.

But if the devices were runtime-suspended, then what? The safest
course is to resume them during the system-wide resume. In that case
yes, the sync vs async would matter.

And if (as happens on many machines) the firmware messes up the
controller settings during resume, then all the USB devices would have
to be reset -- another slow procedure.

Alan Stern

Rafael J. Wysocki

unread,

Dec 7, 2009, 7:00:01 PM12/7/09

to

On Monday 07 December 2009, Linus Torvalds wrote:
>

> On Mon, 7 Dec 2009, Rafael J. Wysocki wrote:
> >
> > BTW, I still don't quite understand why not to put the parent's down_write
> > operation into the core. It's not going to hurt for the "synchronous" devices
> > and the "asynchronous" ones will need to do it anyway.
>
> That's what I started out doing (see the first pseudo-code with the two
> phases). But it _does_ actually hurt.

Hmm. If no one calls down_read() on the "synchronous" devices, their
down_write()s will be nops. In turn, if somebody does call down_read(), it
means they really need to wait for someone. They presumably don't need
to wait for each other, but we don't really know that (otherwise they would
have been "asynchronous").

> Because it will hurt exactly for the "multiple hubs" case: if you have two
> USB hubs in parallel (and the case that Alan pointed out about a USB host
> bridge is the exact same deal), then you want to be able to suspend and
> resume those two independent hubs in parallel too.
>
> But if you do the "down_write()" synchronously in the core, that means
> that you are also stopping the whole "traverse the tree" thing - so now
> you aren't handling the hubs in parallel even if you are handling all the
> devices _behind_ them asynchronously.
>
> This "serialize while traversing the tree" was what I was initially trying
> to avoid with the two-phase approach, but that I realized (after writing
> the resume path) that I could avoid much better by just moving the parents
> down_write into the asynchronous path.

But the asynchronous path has to be started somewhere. Basically, there are
three possible places: the core itself, the bus type's suspend routine called
by the core (same goes for resume of course), and the device driver's suspend
routine called by the bus type.

Now, I don't really see how we can put the the parent's down_write() in a
child's suspend routine, for multiple reasons (one of them being that there can
be multiple asynchronous children the parent needs to wait for), so it looks like
it needs to be above the driver's suspend.

However, the parent can be on a different bus type than the children, so it
looks like we can only start the asynchronous path at the core level.

> > Also it looks like that's something to do unconditionally for all nodes
> > having children, because the parent need not know if the children do async
> > operations.
>
> True, and that was (again) the first iteration. But see above: in order to
> allow way more concurrency, you don't want to introduce the false
> dependency between the write-lock and the traversal of the tree (or, as
> Alan points out - just a list - but that doesn't really change anything)
> that is introduced by taking the lock synchronously.
>
> So by moving the write-lock to the asynchronous work that also shuts down
> the parent, you avoid that whole unnecessary serialization. But that means
> that you can't do the lock in generic code.
>
> Unless you want to do _all_ of the async logic in generic code and
> re-introduce the "dev->async_suspend" flag.

Quite frankly, I would like to.

> I would be ok with that now that the infrastructure seems so simple.

Well, perhaps I should dig out my original async suspend/resume patches
that didn't contain all of the non-essential stuff and post them here for
discussion, after all ...

Rafael

Alan Stern

unread,

Dec 7, 2009, 10:30:02 PM12/7/09

to

On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:

> However, the parent can be on a different bus type than the children, so it
> looks like we can only start the asynchronous path at the core level.

Agreed.

> > Unless you want to do _all_ of the async logic in generic code and
> > re-introduce the "dev->async_suspend" flag.
>
> Quite frankly, I would like to.
>
> > I would be ok with that now that the infrastructure seems so simple.
>
> Well, perhaps I should dig out my original async suspend/resume patches
> that didn't contain all of the non-essential stuff and post them here for
> discussion, after all ...

That seems like a very good idea. IIRC they were quite similar to what
we have been discussing.

Alan Stern

Rafael J. Wysocki

unread,

Dec 8, 2009, 7:40:02 AM12/8/09

to

On Tuesday 08 December 2009, Rafael J. Wysocki wrote:

> On Tuesday 08 December 2009, Alan Stern wrote:
> > On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> >
> > > However, the parent can be on a different bus type than the children, so it
> > > looks like we can only start the asynchronous path at the core level.
> >
> > Agreed.
> >
> > > > Unless you want to do _all_ of the async logic in generic code and
> > > > re-introduce the "dev->async_suspend" flag.
> > >
> > > Quite frankly, I would like to.
> > >
> > > > I would be ok with that now that the infrastructure seems so simple.
> > >
> > > Well, perhaps I should dig out my original async suspend/resume patches
> > > that didn't contain all of the non-essential stuff and post them here for
> > > discussion, after all ...
> >
> > That seems like a very good idea. IIRC they were quite similar to what
> > we have been discussing.
>

> There you go.

Below is the suspend part. It contains some extra code for rolling back the
suspend if one of the asynchronous callbacks returns error code, but apart
from this it's completely analogous to the resume part.

[This patch has only been slightly tested.]

---
drivers/base/power/main.c | 113 ++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 104 insertions(+), 9 deletions(-)

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c
@@ -202,6 +202,17 @@ static void dpm_wait(struct device *dev,
wait_event(dev->power.wait_queue, !!dev->power.op_complete);
}

+static int device_pm_wait_fn(struct device *dev, void *async_ptr)
+{
+ dpm_wait(dev, *((bool *)async_ptr));
+ return 0;
+}
+
+static void dpm_wait_for_children(struct device *dev, bool async)
+{
+ device_for_each_child(dev, &async, device_pm_wait_fn);
+}
+
/**
* dpm_synchronize - Wait for PM callbacks of all devices to complete.
*/
@@ -638,6 +649,8 @@ static void dpm_complete(pm_message_t st
mutex_unlock(&dpm_list_mtx);
}

+static int async_error;
+
/**
* dpm_resume_end - Execute "resume" callbacks and complete system transition.
* @state: PM transition of the system being carried out.
@@ -685,20 +698,52 @@ static pm_message_t resume_event(pm_mess
* The driver of @dev will not receive interrupts while this function is being
* executed.
*/
-static int device_suspend_noirq(struct device *dev, pm_message_t state)
+static int __device_suspend_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

- if (!dev->bus)
- return 0;
-
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "LATE ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
+
+ dpm_finish(dev);
+
return error;
}

+static void async_suspend_noirq(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error = async_error;
+
+ if (error)
+ return;
+
+ dpm_wait_for_children(dev, true);
+ error = __device_suspend_noirq(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async LATE", error);
+ dev->power.status = DPM_OFF;
+ }
+ put_device(dev);
+
+ if (error && !async_error)
+ async_error = error;
+}
+
+static int device_suspend_noirq(struct device *dev)
+{
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend_noirq, dev);
+ return 0;
+ }
+
+ dpm_wait_for_children(dev, false);
+ return __device_suspend_noirq(dev, pm_transition);
+}
+
/**
* dpm_suspend_noirq - Execute "late suspend" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -713,14 +758,21 @@ int dpm_suspend_noirq(pm_message_t state

suspend_device_irqs();
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
list_for_each_entry_reverse(dev, &dpm_list, power.entry) {
- error = device_suspend_noirq(dev, state);
+ dev->power.status = DPM_OFF_IRQ;
+ error = device_suspend_noirq(dev);
if (error) {
pm_dev_err(dev, state, " late", error);
+ dev->power.status = DPM_OFF;
+ break;
+ }
+ if (async_error) {
+ error = async_error;
break;
}
- dev->power.status = DPM_OFF_IRQ;
}
+ dpm_synchronize();
mutex_unlock(&dpm_list_mtx);
if (error)
dpm_resume_noirq(resume_event(state));
@@ -733,7 +785,7 @@ EXPORT_SYMBOL_GPL(dpm_suspend_noirq);
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;

@@ -773,10 +825,45 @@ static int device_suspend(struct device
}
End:
up(&dev->sem);
+ dpm_finish(dev);

return error;
}

+static void async_suspend(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error = async_error;
+
+ if (error)
+ goto End;
+
+ dpm_wait_for_children(dev, true);
+ error = __device_suspend(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);
+
+ dev->power.status = DPM_SUSPENDING;
+ if (!async_error)
+ async_error = error;
+ }
+
+ End:
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);
+ return 0;
+ }
+
+ dpm_wait_for_children(dev, false);
+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.
* @state: PM transition of the system being carried out.
@@ -788,10 +875,12 @@ static int dpm_suspend(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.prev);

get_device(dev);
+ dev->power.status = DPM_OFF;
mutex_unlock(&dpm_list_mtx);

error = device_suspend(dev, state);
@@ -799,16 +888,21 @@ static int dpm_suspend(pm_message_t stat
mutex_lock(&dpm_list_mtx);
if (error) {
pm_dev_err(dev, state, "", error);
+ dev->power.status = DPM_SUSPENDING;
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ if (async_error)
+ break;
}
list_splice(&list, dpm_list.prev);
+ dpm_synchronize();
mutex_unlock(&dpm_list_mtx);
+ if (!error)
+ error = async_error;
return error;
}

@@ -867,6 +961,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ async_error = 0;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

Rafael J. Wysocki

unread,

Dec 8, 2009, 7:40:02 AM12/8/09

to

On Tuesday 08 December 2009, Alan Stern wrote:

> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > However, the parent can be on a different bus type than the children, so it
> > looks like we can only start the asynchronous path at the core level.
>
> Agreed.
>
> > > Unless you want to do _all_ of the async logic in generic code and
> > > re-introduce the "dev->async_suspend" flag.
> >
> > Quite frankly, I would like to.
> >
> > > I would be ok with that now that the infrastructure seems so simple.
> >
> > Well, perhaps I should dig out my original async suspend/resume patches
> > that didn't contain all of the non-essential stuff and post them here for
> > discussion, after all ...
>
> That seems like a very good idea. IIRC they were quite similar to what
> we have been discussing.

There you go.

Below is the resume part. I have reworked the original patch a bit so that
it's even simpler. I'll post the suspend part in a reply to this message.

The idea is basically that if a device has the power.async_suspend flag set,
we schedule the execution of it's resume callback asynchronously, but we
wait for the device's parent to finish resume before the device's suspend is
actually executed.

The wait queue plus the op_complete flag combo plays the role of the locking
in the Linus' picture, and it's essentially equivalent, since the devices being
waited for during resume will have to wait during suspend, so for example if
A has to wait for B during suspend, then B will have to wait for A during
resume (thus they both need to know in advance who's going to wait for them
and whom they need to wait for).

Of course, the code in this patch has the problem that if there are two
"asynchronous" devices in dpm_list separated by a series of "synchronous"
devices, then they usually won't be resumed in parallel (which is what we
ultimately want). That can be optimised in a couple of ways, but such
optimisations add quite some details to the code, so let's just omit them for
now.

BTW, thanks to the discussion with Linus I've realized that the off-tree
dependences may be (relatively easily) taken into account by making the
interested drivers directly execute dpm_wait() for the extra devices they
need to wait for, so the entire PM links thing is simply unnecessary. So it
looks like the only thing this patch is missing are the optimisations mentioned
above.

[This version of the patch has only been slightly tested.]

---
drivers/base/power/main.c | 129 +++++++++++++++++++++++++++++++++++++++----
include/linux/device.h | 6 ++
include/linux/pm.h | 4 +
include/linux/resume-trace.h | 7 ++
4 files changed, 134 insertions(+), 12 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h
@@ -412,15 +412,17 @@ struct dev_pm_info {
pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */
+ wait_queue_head_t wait_queue;
#ifdef CONFIG_PM_SLEEP
struct list_head entry;
+ unsigned int op_complete:1;
#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;
unsigned long timer_expires;
struct work_struct work;
- wait_queue_head_t wait_queue;
spinlock_t lock;
atomic_t usage_count;
atomic_t child_count;
Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h
+++ linux-2.6/include/linux/device.h
@@ -472,6 +472,12 @@ static inline int device_is_registered(s
return dev->kobj.state_in_sysfs;
}

+static inline void device_enable_async_suspend(struct device *dev, bool enable)
+{
+ if (dev->power.status == DPM_ON)
+ dev->power.async_suspend = enable;
+}
+
void driver_init(void);

/*

Index: linux-2.6/drivers/base/power/main.c
===================================================================
--- linux-2.6.orig/drivers/base/power/main.c
+++ linux-2.6/drivers/base/power/main.c

@@ -25,6 +25,7 @@
#include <linux/resume-trace.h>
#include <linux/rwsem.h>
#include <linux/interrupt.h>
+#include <linux/async.h>

#include "../base.h"
#include "power.h"
@@ -42,6 +43,7 @@
LIST_HEAD(dpm_list);

static DEFINE_MUTEX(dpm_list_mtx);
+static pm_message_t pm_transition;

/*
* Set once the preparation of devices for a PM transition has started, reset
@@ -56,6 +58,7 @@ static bool transition_started;
void device_pm_init(struct device *dev)
{
dev->power.status = DPM_ON;
+ init_waitqueue_head(&dev->power.wait_queue);
pm_runtime_init(dev);
}

@@ -162,6 +165,56 @@ void device_pm_move_last(struct device *
}

/**
+ * dpm_reset - Clear op_complete for given device.
+ * @dev: Device to handle.
+ */
+static void dpm_reset(struct device *dev)
+{
+ dev->power.op_complete = false;
+}
+
+/**
+ * dpm_finish - Set op_complete for a device and wake up threads waiting for it.
+ */
+static void dpm_finish(struct device *dev)
+{
+ dev->power.op_complete = true;
+ wake_up_all(&dev->power.wait_queue);
+}
+
+/**
+ * dpm_wait - Wait for a PM operation to complete.
+ * @dev: Device to wait for.
+ * @async: If true, ignore the device's async_suspend flag.
+ *
+ * Wait for a PM operation carried out for @dev to complete, unless @dev has to
+ * be handled synchronously and @async is false.
+ */
+static void dpm_wait(struct device *dev, bool async)
+{
+ if (!dev)
+ return;
+
+ if (!(async || dev->power.async_suspend))
+ return;
+
+ if (!dev->power.op_complete)
+ wait_event(dev->power.wait_queue, !!dev->power.op_complete);
+}
+
+/**
+ * dpm_synchronize - Wait for PM callbacks of all devices to complete.
+ */
+static void dpm_synchronize(void)
+{
+ struct device *dev;
+
+ async_synchronize_full();
+ list_for_each_entry(dev, &dpm_list, power.entry)
+ dpm_reset(dev);
+}
+
+/**
* pm_op - Execute the PM operation appropriate for given PM event.

* @dev: Device to handle.

* @ops: PM operations to choose from.
@@ -334,25 +387,48 @@ static void pm_dev_err(struct device *de

* The driver of @dev will not receive interrupts while this function is being
* executed.
*/

-static int device_resume_noirq(struct device *dev, pm_message_t state)
+static int __device_resume_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

- if (!dev->bus)
- goto End;

-
- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {

pm_dev_dbg(dev, state, "EARLY ");

error = pm_noirq_op(dev, dev->bus->pm, state);
}

- End:
+
+ dpm_finish(dev);
+
TRACE_RESUME(error);
return error;
}

+static void async_resume_noirq(void *data, async_cookie_t cookie)

+{
+ struct device *dev = (struct device *)data;

+ int error;
+
+ dpm_wait(dev->parent, true);
+ error = __device_resume_noirq(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async EARLY", error);
+ put_device(dev);
+}
+
+static int device_resume_noirq(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume_noirq, dev);

+ return 0;
+ }
+

+ dpm_wait(dev->parent, false);
+ return __device_resume_noirq(dev, pm_transition);
+}
+
/**
* dpm_resume_noirq - Execute "early resume" callbacks for non-sysdev devices.

* @state: PM transition of the system being carried out.

@@ -366,26 +442,28 @@ void dpm_resume_noirq(pm_message_t state

mutex_lock(&dpm_list_mtx);
transition_started = false;
+ pm_transition = state;
list_for_each_entry(dev, &dpm_list, power.entry)
if (dev->power.status > DPM_OFF) {
int error;

dev->power.status = DPM_OFF;
- error = device_resume_noirq(dev, state);
+ error = device_resume_noirq(dev);
if (error)
pm_dev_err(dev, state, " early", error);
}
+ dpm_synchronize();
mutex_unlock(&dpm_list_mtx);
resume_device_irqs();
}
EXPORT_SYMBOL_GPL(dpm_resume_noirq);

/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.

* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/

-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;

@@ -426,11 +504,36 @@ static int device_resume(struct device *

}
End:
up(&dev->sem);
+ dpm_finish(dev);

TRACE_RESUME(error);
return error;
}

+static void async_resume(void *data, async_cookie_t cookie)

+{
+ struct device *dev = (struct device *)data;

+ int error;
+
+ dpm_wait(dev->parent, true);
+ error = __device_resume(dev, pm_transition);

+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);

+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);

+ return 0;
+ }
+

+ dpm_wait(dev->parent, false);
+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.

* @state: PM transition of the system being carried out.

@@ -444,6 +547,7 @@ static void dpm_resume(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {

struct device *dev = to_device(dpm_list.next);

@@ -454,7 +558,7 @@ static void dpm_resume(pm_message_t stat
dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);

- error = device_resume(dev, state);
+ error = device_resume(dev);

mutex_lock(&dpm_list_mtx);
if (error)
@@ -468,6 +572,7 @@ static void dpm_resume(pm_message_t stat
put_device(dev);
}
list_splice(&list, &dpm_list);
+ dpm_synchronize();
mutex_unlock(&dpm_list_mtx);
}

@@ -793,8 +898,10 @@ static int dpm_prepare(pm_message_t stat
break;
}
dev->power.status = DPM_SUSPENDING;
- if (!list_empty(&dev->power.entry))
+ if (!list_empty(&dev->power.entry)) {
list_move_tail(&dev->power.entry, &list);
+ dpm_reset(dev);
+ }
put_device(dev);
}
list_splice(&list, &dpm_list);
Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@

extern int pm_trace_enabled;

+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const

#else

+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)

Linus Torvalds

unread,

Dec 8, 2009, 10:40:03 AM12/8/09

to

On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> The wait queue plus the op_complete flag combo plays the role of the locking
> in the Linus' picture

Please just use the lock. Don't make up your own locking crap. Really.

Your patch is horrible. Exactly because your locking is horribly
mis-designed. You can't say things are complete from an interrupt, for
example, since you made it some random bitfield, which has unknown
characteristics (ie non-atomic read-modify-write etc).

The fact is, any time anybody makes up a new locking mechanism, THEY
ALWAYS GET IT WRONG. Don't do it.

I suggested using the rwsem locking for a good reason. It made sense. It
was simpler. Just do it that way, stop making up crap.

Linus

Alan Stern

unread,

Dec 8, 2009, 11:00:03 AM12/8/09

to

On Tue, 8 Dec 2009, Linus Torvalds wrote:

> Please just use the lock. Don't make up your own locking crap. Really.
>
> Your patch is horrible. Exactly because your locking is horribly
> mis-designed. You can't say things are complete from an interrupt, for
> example, since you made it some random bitfield, which has unknown
> characteristics (ie non-atomic read-modify-write etc).
>
> The fact is, any time anybody makes up a new locking mechanism, THEY
> ALWAYS GET IT WRONG. Don't do it.
>
> I suggested using the rwsem locking for a good reason. It made sense. It
> was simpler. Just do it that way, stop making up crap.

The semantics needed for this kind of lock aren't really the same as
for an rwsem (although obviously an rwsem will do the job). Basically
it needs to have the capability for multiple users to lock it (no
blocking when acquiring a lock) and the capability for a user to wait
until it is totally unlocked. It could be implemented trivially using
an atomic_t counter and a waitqueue head.

Is this a standard sort of lock? It's a lot simpler than most others.
I don't recall seeing anything quite like it anywhere; the closest
thing might be some kind of barrier.

Alan Stern

Linus Torvalds

unread,

Dec 8, 2009, 11:50:01 AM12/8/09

to

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> The semantics needed for this kind of lock aren't really the same as
> for an rwsem (although obviously an rwsem will do the job). Basically
> it needs to have the capability for multiple users to lock it (no
> blocking when acquiring a lock) and the capability for a user to wait
> until it is totally unlocked. It could be implemented trivially using
> an atomic_t counter and a waitqueue head.
>
> Is this a standard sort of lock?

Yes it is.

It's called a rwlock. The counter is for readers, the exclusion is for
writers.

Really.

And the thing is, you actually do want the rwlock semantics, because on
the resume side you want the parent to lock it for writing first (so that
the children can wait for the parent to have completed its resume.

So we actually _want_ the full rwlock semantics.

See the code I posted earlier. Here condensed into one email:

- resume:

usb_node_resume(node)
{
// Wait for parent to finish resume
down_read(node->parent->lock);
// .. before resuming outselves
node->resume(node)

// Now we're all done

up_read(node->parent->lock);
up_write(node->lock);
}

/* caller: */
..
// This won't block, because we resume parents before children,
// and the children will take the read lock.
down_write(leaf->lock);
// Do the blocking part asynchronously
async_schedule(usb_node_resume, leaf);
..

- suspend:

usb_node_suspend(node)
{
// Start our suspend. This will block if we have any

// children that are still busy suspending (they will
// have done a down_read() in their suspend).
down_write(node->lock);
node->suspend(node);
up_write(node->lock);

// This lets our parent continue
up_read(node->parent->lock);
}

/* caller: */

// This won't block, because we suspend nodes before parents

down_read(node->parent->lock);
// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);

It really should be that simple. Nothing more, nothing less. And with the
above, finishing the suspend (or resume) from interrupts is fine, and you
don't have any new lock that has undefined memory ordering issues etc.

Linus

Alan Stern

unread,

Dec 8, 2009, 1:10:02 PM12/8/09

to

On Tue, 8 Dec 2009, Linus Torvalds wrote:

> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > The semantics needed for this kind of lock aren't really the same as
> > for an rwsem (although obviously an rwsem will do the job). Basically
> > it needs to have the capability for multiple users to lock it (no
> > blocking when acquiring a lock) and the capability for a user to wait
> > until it is totally unlocked. It could be implemented trivially using
> > an atomic_t counter and a waitqueue head.
> >
> > Is this a standard sort of lock?
>
> Yes it is.
>
> It's called a rwlock. The counter is for readers, the exclusion is for
> writers.
>
> Really.
>
> And the thing is, you actually do want the rwlock semantics, because on
> the resume side you want the parent to lock it for writing first (so that
> the children can wait for the parent to have completed its resume.
>
> So we actually _want_ the full rwlock semantics.

I'm not convinced. Condense the description a little farther:

Suspend: Children lock the parent first. When they are
finished they unlock the parent, allowing it to
proceed.

Resume: Parent locks itself first. When it is finished
it unlocks itself, allowing the children to proceed.

The whole readers vs. writers thing is a non-sequitur. (For instance,
this never uses the fact that writers exclude each other.) In each
case a lock is taken and eventually released, allowing someone else to
stop waiting and move forward. In the suspend case we have multiple
lockers and one waiter, whereas in the resume case we have one locker
and multiple waiters.

The simplest generalization is to allow both multiple lockers and
multiple waiters. Call it a waitlock, for want of a better name:

wait_lock(wl)
{
atomic_inc(&wl->count);
}

wait_unlock(wl)
{
if (atomic_dec_and_test(&wl->count)) {
smp_mb__after_atomic_dec();
wake_up_all(wl->wqh);
}
}

wait_for_lock(wl)
{
wait_event(wl->wqh, atomic_read(&wl->count) == 0);
smp_rmb();
}

Note that both wait_lock() and wait_unlock() can be called
in_interrupt.

> See the code I posted earlier. Here condensed into one email:
>
> - resume:
>
> usb_node_resume(node)
> {
> // Wait for parent to finish resume
> down_read(node->parent->lock);
> // .. before resuming outselves
> node->resume(node)
>
> // Now we're all done
> up_read(node->parent->lock);
> up_write(node->lock);
> }
>
> /* caller: */
> ..
> // This won't block, because we resume parents before children,
> // and the children will take the read lock.
> down_write(leaf->lock);
> // Do the blocking part asynchronously
> async_schedule(usb_node_resume, leaf);
> ..

This becomes:

usb_node_resume(node)
{
// Wait for parent to finish resume

wait_for_lock(node->parent->lock);

// .. before resuming outselves
node->resume(node)

// Now we're all done

wait_unlock(node->lock);
}

/* caller: */
..
// This can't block, because wait_lock() is non-blocking.
wait_lock(node->lock);

// Do the blocking part asynchronously
async_schedule(usb_node_resume, leaf);
..

> - suspend:
>
> usb_node_suspend(node)
> {
> // Start our suspend. This will block if we have any
> // children that are still busy suspending (they will
> // have done a down_read() in their suspend).
> down_write(node->lock);
> node->suspend(node);
> up_write(node->lock);
>
> // This lets our parent continue
> up_read(node->parent->lock);
> }
>
> /* caller: */
>
> // This won't block, because we suspend nodes before parents
> down_read(node->parent->lock);
> // Do the part that may block asynchronously
> async_schedule(do_usb_node_suspend, node);

usb_node_suspend(node)
{
// Start our suspend. This will block if we have any
// children that are still busy suspending (they will

// have done a wait_lock() in their suspend).
wait_for_lock(node->lock);
node->suspend(node);

// This lets our parent continue

wait_unlock(node->parent->lock);
}

/* caller: */
..
// This can't block, because wait_lock is non-blocking.
wait_lock(node->parent->lock);

// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);

..

> It really should be that simple. Nothing more, nothing less. And with the
> above, finishing the suspend (or resume) from interrupts is fine, and you
> don't have any new lock that has undefined memory ordering issues etc.

Aren't waitlocks simpler than rwsems? Not as generally useful,
perhaps. But just as correct in this situation.

Alan Stern

Linus Torvalds

unread,

Dec 8, 2009, 1:50:02 PM12/8/09

to

On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > So we actually _want_ the full rwlock semantics.
>
> I'm not convinced. Condense the description a little farther:
>
> Suspend: Children lock the parent first. When they are
> finished they unlock the parent, allowing it to
> proceed.
>
> Resume: Parent locks itself first. When it is finished
> it unlocks itself, allowing the children to proceed.

Yes. You can implement it with a simple lock with a count. Nobody debates
that.

But a simple counting lock _is_ a rwlock. Really. They are 100%
semantically equivalent. There is no difference.

> The whole readers vs. writers thing is a non-sequitur.

No it's not.

It's a 100% equivalent problem. It's purely a change of wording. The end
result is the same.

> The simplest generalization is to allow both multiple lockers and
> multiple waiters. Call it a waitlock, for want of a better name:

But we have that. It _has_ a better name: rwlocks.

And the reason the name is better is because now the name describes all
the semantics to anybody who has ever taken a course in operating systems
or in parallelism.

It's also a better implementation, because it actually _works_.

> wait_lock(wl)
> {
> atomic_inc(&wl->count);
> }
>
> wait_unlock(wl)
> {
> if (atomic_dec_and_test(&wl->count)) {
> smp_mb__after_atomic_dec();
> wake_up_all(wl->wqh);
> }
> }
>
> wait_for_lock(wl)
> {
> wait_event(wl->wqh, atomic_read(&wl->count) == 0);
> smp_rmb();
> }
>
> Note that both wait_lock() and wait_unlock() can be called
> in_interrupt.

And note how even though you sprinkled random memory barriers around, you
still got it wrong.

So you just implemented a buggy lock, and for what gain? Tell me exactly
why your buggy lock (assuming you'd know enough about memory ordering to
actually fix it) is better than just using the existing one?

It's certainly not smaller. It's not faster. It doesn't have support for
lockdep. And it's BUGGY.

Really. Tell me why you want to re-implement an existing lock - badly.

[ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
that anybody else who sees the new value will be guaranteed to have seen
anything else the unlocker did.

You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
allow writes to migrate up either. 'atomic_read()' does not imply any
barriers.

But most architectures can optimize these things for their particular
memory ordering model, and do so in their rwsem implementation. ]

> This becomes:
>
> usb_node_resume(node)
> {
> // Wait for parent to finish resume
> wait_for_lock(node->parent->lock);
> // .. before resuming outselves
> node->resume(node)
>
> // Now we're all done
> wait_unlock(node->lock);
> }
>
> /* caller: */
> ..
> // This can't block, because wait_lock() is non-blocking.
> wait_lock(node->lock);
> // Do the blocking part asynchronously
> async_schedule(usb_node_resume, leaf);
> ..

Umm? Same thing, different words?

That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'. We
_could_ expose such a mechanism for rwsem's too, but why do it? It's
actually nicer to use a real read-lock - and do it _around_ the operation,
because now the locking also automatically gets things like overlapping
suspends and resumes right.

(Which you'd obviously hope never happens, but it's nice from a conceptual
standpoint to know that the locking is robust).

> Aren't waitlocks simpler than rwsems? Not as generally useful,
> perhaps. But just as correct in this situation.

NO!

Dammit. I started this whole rant with this comment to Rafael:

"The fact is, any time anybody makes up a new locking mechanism, THEY
ALWAYS GET IT WRONG. Don't do it."

Take heed. You got it wrong. Admit it. Locking is _hard_. SMP memory
ordering is HARD.

So leave locking to the pro's. They _also_ got it wrong, but they got it
wrong several years ago, and fixed up their sh*t.

This is why you use generic locking. ALWAYS.

Linus

Linus Torvalds

unread,

Dec 8, 2009, 2:00:02 PM12/8/09

to

On Tue, 8 Dec 2009, Linus Torvalds wrote:
>
> [ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
> that anybody else who sees the new value will be guaranteed to have seen
> anything else the unlocker did.
>
> You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> allow writes to migrate up either. 'atomic_read()' does not imply any
> barriers.
>
> But most architectures can optimize these things for their particular
> memory ordering model, and do so in their rwsem implementation. ]

Side note: if this was a real lock, you'd also needed an smp_wmb() in the
'wait_lock()' path after the atomic_inc(), to make sure that others see
the atomic lock was seen by other people before the suspend started.

In your usage scenario, I don't think it would ever be noticeable, since
the other users are always going to start running from the same thread
that did the wait_lock(), so even if they run on other CPU's, we'll have
scheduled _to_ those other CPU's and done enough memory ordering to
guarantee that they will see the thing.

So it would be ok in this situation, simply because it acts as an
initializer and never sees any real SMP issues.

But it's an example of how you now don't just depend on the locking
primitives themselves doing the right thing, you end up depending very
subtly on exactly how the lock is used. The standard locks do have the
same kind of issue for initializers, but we avoid it elsewhere because
it's so risky.

Alan Stern

unread,

Dec 8, 2009, 2:40:03 PM12/8/09

to

On Tue, 8 Dec 2009, Linus Torvalds wrote:

> > The whole readers vs. writers thing is a non-sequitur.
>
> No it's not.
>
> It's a 100% equivalent problem. It's purely a change of wording. The end
> result is the same.

Well, of course the end result is the same (ignoring bugs) -- that was
the point. It doesn't follow that the two locking mechanisms are 100%
equivalent.

> And note how even though you sprinkled random memory barriers around, you
> still got it wrong.

Yes. That comes of trying to think at the keyboard.

> It's certainly not smaller. It's not faster. It doesn't have support for
> lockdep. And it's BUGGY.

Lockdep will choke on the rwsem approach anyway. It has never been
very good at handling tree-structured locking, especially when there
are non-parent-child interactions. But never mind.

> Really. Tell me why you want to re-implement an existing lock - badly.

I didn't want to. The whole exercise was intended to make a point --
that rwsems do more than we really need here.

> [ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
> that anybody else who sees the new value will be guaranteed to have seen
> anything else the unlocker did.

Yes.

> You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> allow writes to migrate up either. 'atomic_read()' does not imply any
> barriers.

No, that's not needed. Unlike reads, writes can't move in front of
data or control dependencies. Or so I've been lead to believe...

> That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'.

Not really. It also corresponds to a 'write_lock()+write_unlock()' (in
the suspend routine). Are you claiming these two compound operations
are equivalent?

> We
> _could_ expose such a mechanism for rwsem's too, but why do it? It's
> actually nicer to use a real read-lock - and do it _around_ the operation,
> because now the locking also automatically gets things like overlapping
> suspends and resumes right.
>
> (Which you'd obviously hope never happens, but it's nice from a conceptual
> standpoint to know that the locking is robust).

> Take heed. You got it wrong. Admit it. Locking is _hard_. SMP memory
> ordering is HARD.

Oh, there's no question about that. I never seriously intended this
stuff to be adopted. It was just for discussion.

Alan Stern

unread,

Dec 8, 2009, 2:40:02 PM12/8/09

to

On Tue, 8 Dec 2009, Linus Torvalds wrote:

> Side note: if this was a real lock, you'd also needed an smp_wmb() in the
> 'wait_lock()' path after the atomic_inc(), to make sure that others see
> the atomic lock was seen by other people before the suspend started.
>
> In your usage scenario, I don't think it would ever be noticeable, since
> the other users are always going to start running from the same thread
> that did the wait_lock(), so even if they run on other CPU's, we'll have
> scheduled _to_ those other CPU's and done enough memory ordering to
> guarantee that they will see the thing.
>
> So it would be ok in this situation, simply because it acts as an
> initializer and never sees any real SMP issues.

Yes. I would have brought this up, but you made the point for me.

> But it's an example of how you now don't just depend on the locking
> primitives themselves doing the right thing, you end up depending very
> subtly on exactly how the lock is used. The standard locks do have the
> same kind of issue for initializers, but we avoid it elsewhere because
> it's so risky.

No doubt there are other reasons why the "wait-lock" pattern doesn't
get used enough to be noticed.

Alan Stern

Rafael J. Wysocki

unread,

Dec 8, 2009, 2:50:01 PM12/8/09

to

On Tuesday 08 December 2009, Linus Torvalds wrote:
>
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
> >
> > The wait queue plus the op_complete flag combo plays the role of the locking
> > in the Linus' picture
>
> Please just use the lock. Don't make up your own locking crap. Really.
>
> Your patch is horrible. Exactly because your locking is horribly
> mis-designed. You can't say things are complete from an interrupt, for
> example, since you made it some random bitfield, which has unknown
> characteristics (ie non-atomic read-modify-write etc).

I didn't assume anyone would check it from an interrupt, because I didn't see
a point. In fact I didn't assume anyone except for the PM core would check it.
In case this assumption is wrong, it can be easily put under the dev->sem
that we take anyway before calling the bus type (etc.) callbacks.

Anyway, if we use an rwsem, it won't be checkable from interrupt context just
as well.

> The fact is, any time anybody makes up a new locking mechanism, THEY
> ALWAYS GET IT WRONG. Don't do it.
>
> I suggested using the rwsem locking for a good reason. It made sense. It
> was simpler. Just do it that way, stop making up crap.

Suppose we use rwsem and during suspend each child uses a down_read() on a
parent and then the parent uses down_write() on itself. What if, whatever the
reason, the parent is a bit early and does the down_write() before one of the
children has a chance to do the down_read()? Aren't we toast?

Do we need any direct protection against that or does it just work itself out
in a way I just don't see right now?

Rafael

Alan Stern

unread,

Dec 8, 2009, 3:20:02 PM12/8/09

to

On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:

> Suppose we use rwsem and during suspend each child uses a down_read() on a
> parent and then the parent uses down_write() on itself. What if, whatever the
> reason, the parent is a bit early and does the down_write() before one of the
> children has a chance to do the down_read()? Aren't we toast?
>
> Do we need any direct protection against that or does it just work itself out
> in a way I just don't see right now?

That's not the way it should be done. Linus had children taking their
parents' locks during suspend, which is simple but leads to
difficulties.

Instead, the PM core should do a down_write() on each device before
starting the device's async suspend routine, and an up_write() when the
routine finishes. Parents should, at the start of their async routine,
do down_read() on each of their children plus whatever other devices
they need to wait for. The core can do the waiting for children part
and the driver's suspend routine can handle any other waiting.

This is a little more awkward because it requires the parent to iterate
through its children. But it does solve the off-tree dependency
problem for suspends.

Alan Stern

Rafael J. Wysocki

unread,

Dec 8, 2009, 3:40:03 PM12/8/09

to

On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > Suppose we use rwsem and during suspend each child uses a down_read() on a
> > parent and then the parent uses down_write() on itself. What if, whatever the
> > reason, the parent is a bit early and does the down_write() before one of the
> > children has a chance to do the down_read()? Aren't we toast?
> >
> > Do we need any direct protection against that or does it just work itself out
> > in a way I just don't see right now?
>
> That's not the way it should be done. Linus had children taking their
> parents' locks during suspend, which is simple but leads to
> difficulties.
>
> Instead, the PM core should do a down_write() on each device before
> starting the device's async suspend routine, and an up_write() when the
> routine finishes. Parents should, at the start of their async routine,
> do down_read() on each of their children plus whatever other devices
> they need to wait for. The core can do the waiting for children part
> and the driver's suspend routine can handle any other waiting.
>
> This is a little more awkward because it requires the parent to iterate
> through its children.

I can live with that.

> But it does solve the off-tree dependency problem for suspends.

That's a plus, but I still think we're trying to create a barrier-alike
mechanism using lock.

There's one more possibility to consider, though. What if we use a completion
instead of the flag + wait queue? It surely is a standard synchronization
mechanism and it seems it might work here.

Rafael

Linus Torvalds

unread,

Dec 8, 2009, 3:50:01 PM12/8/09

to

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> > You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> > allow writes to migrate up either. 'atomic_read()' does not imply any
> > barriers.
>
> No, that's not needed. Unlike reads, writes can't move in front of
> data or control dependencies. Or so I've been lead to believe...

Sure they can. Control dependencies are trivial - it's called "branch
prediction", and everybody does it, and data dependencies don't exist on
many CPU architectures (even to the point of reading through a pointer
that you loaded).

But yes, on x86, stores only move down. But that's an x86-specific thing.

[ Not that it's also not very common - write buffering is easy and matters
for performance, so any in-order implementation will generally do it. In
contrast, writes moving up doesn't really help peformance and is harder
to do, but can happen with a weakly ordered memory subsystem especially
if you have multi-way caches where some ways are busy and end up being
congested.

So the _common_ case is definitely about delaying writes and doing reads
early if possible. But it's not necessarily at all guaranteed in
general. ]

> > That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'.
>
> Not really. It also corresponds to a 'write_lock()+write_unlock()' (in
> the suspend routine). Are you claiming these two compound operations
> are equivalent?

They have separate semantics, and you just want to pick the one that suits
you. Your counting lock doesn't have the "read_lock+read_unlock" version,
it only has the write_lock/unlock one (ie it requires totally unlocked
thing).

The point being, rwsem's can do everything your counting lock does. And
they already exist. And they already know about all the subtleties of
architecture-specific memory ordering etc.

Linus

Alan Stern

unread,

Dec 8, 2009, 3:50:01 PM12/8/09

to

On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:

> > This is a little more awkward because it requires the parent to iterate
> > through its children.
>
> I can live with that.
>
> > But it does solve the off-tree dependency problem for suspends.
>
> That's a plus, but I still think we're trying to create a barrier-alike
> mechanism using lock.
>
> There's one more possibility to consider, though. What if we use a completion
> instead of the flag + wait queue? It surely is a standard synchronization
> mechanism and it seems it might work here.

You're right. I should have thought of that. Linus's original
approach couldn't use a completion because during suspend it needed to
make one task (the parent) wait for a bunch of others (the children).
But if you iterate through the children by hand, that objection no
longer applies.

Alan Stern

Rafael J. Wysocki

unread,

Dec 8, 2009, 4:00:03 PM12/8/09

to

On Tuesday 08 December 2009, Alan Stern wrote:

> On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> > > This is a little more awkward because it requires the parent to iterate
> > > through its children.
> >
> > I can live with that.
> >
> > > But it does solve the off-tree dependency problem for suspends.
> >
> > That's a plus, but I still think we're trying to create a barrier-alike
> > mechanism using lock.
> >
> > There's one more possibility to consider, though. What if we use a completion
> > instead of the flag + wait queue? It surely is a standard synchronization
> > mechanism and it seems it might work here.
>
> You're right. I should have thought of that. Linus's original
> approach couldn't use a completion because during suspend it needed to
> make one task (the parent) wait for a bunch of others (the children).
> But if you iterate through the children by hand, that objection no
> longer applies.

BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
here?

Rafael

Linus Torvalds

unread,

Dec 8, 2009, 4:10:01 PM12/8/09

to

On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> Anyway, if we use an rwsem, it won't be checkable from interrupt context just
> as well.

You can't do a lock() from an interrupt, but the unlocks should be
irq-safe.

> Suppose we use rwsem and during suspend each child uses a down_read() on a
> parent and then the parent uses down_write() on itself. What if, whatever the
> reason, the parent is a bit early and does the down_write() before one of the
> children has a chance to do the down_read()? Aren't we toast?

We're toast, but we're toast for a totally unrealted reason: it means that
you tried to resume a child before a parent, which would be a major bug to
begin with.

Look, I even wrote out the comments, so let me repeat the code one more
time.

- suspend time calling:

// This won't block, because we suspend nodes before parents
down_read(node->parent->lock);
// Do the part that may block asynchronously
async_schedule(do_usb_node_suspend, node);

- resume time calling:

// This won't block, because we resume parents before children,
// and the children will take the read lock.

down_write(leaf->lock);

// Do the blocking part asynchronously
async_schedule(usb_node_resume, leaf);

See? So when we take the parent lock for suspend, we are guaranteed to do
so _before_ the parent node itself suspends. And conversely, when we take
the parent lock (asynchronously) for resume, we're guaranteed to do that
_after_ the parent node has done its own down_write.

And that all depends on just one trivial thing; that the suspend and
resume is called in the right order (children first vs parent first
respectively). And that is such a _major_ correctness issue that if that
isn't correct, your suspend isn't going to work _anyway_.

Linus

Linus Torvalds

unread,

Dec 8, 2009, 4:10:02 PM12/8/09

to

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> That's not the way it should be done. Linus had children taking their
> parents' locks during suspend, which is simple but leads to
> difficulties.

No it doesn't. Name them.

> Instead, the PM core should do a down_write() on each device before
> starting the device's async suspend routine, and an up_write() when the
> routine finishes.

No you should NOT do that. If you do that, you serialize the suspend
incorrectly and much too early. IOW, think a topology like this:

a -> b -> c
\
> d -> e

where you'd want to suspend 'c' and 'e' asynchronously. If we do a
'down-write()' on b, then we'll delay until 'c' has suspended, an if we
have ordered the nodes in the obvious depth-first order, we'll walk the PM
device list in the order:

c b e d a

and now we'll serialize on 'b', waiting for 'c' to suspend. Which we do
_not_ want to do, because the whole point was to suspend 'c' and 'e'
together.

> Parents should, at the start of their async routine,
> do down_read() on each of their children plus whatever other devices
> they need to wait for. The core can do the waiting for children part
> and the driver's suspend routine can handle any other waiting.

Why?

That just complicates things. Compare to my simple locking scheme I've
quoted several times.

Linus

Linus Torvalds

unread,

Dec 8, 2009, 4:20:03 PM12/8/09

to

On Tue, 8 Dec 2009, Linus Torvalds wrote:
> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > That's not the way it should be done. Linus had children taking their
> > parents' locks during suspend, which is simple but leads to
> > difficulties.
>
> No it doesn't. Name them.

Really.

Let me put this simply: I've told you guys how to do it simply, with
_zero_ crap. No "iterating over children". No games. No data structures.
No new infrastructure. Just a single new rwlock per device, and _trivial_
code.

So here's the challenge: try it my simple way first. I've quoted the code
about five million times already. If you _actually_ see some problems,
explain them. Don't make up stupid "iterate over each child" things. Don't
claim totally made-up "leads to difficulties". Don't make it any more
complicated than it needs to be.

Keep it simple. And once you have tried that simple approach, and you
really can show why it doesn't work, THEN you can try something else.

But before you try the simple approach and explain why it wouldn't work, I
simply will not pull anything more complex. Understood and agreed?

Alan Stern

unread,

Dec 8, 2009, 4:40:01 PM12/8/09

to

On Tue, 8 Dec 2009, Linus Torvalds wrote:

> > No, that's not needed. Unlike reads, writes can't move in front of
> > data or control dependencies. Or so I've been lead to believe...
>
> Sure they can. Control dependencies are trivial - it's called "branch
> prediction", and everybody does it, and data dependencies don't exist on
> many CPU architectures (even to the point of reading through a pointer
> that you loaded).

Wait a second. Are you saying that with code like this:

if (x == 1)
y = 5;

the CPU may write to y before it has finished reading the value of x?
And this write is visible to other CPUs, so that if x was initially 0
and a second CPU sets x to 1, the second CPU may see y == 5 before it
executes the write to x (whatever that may mean)?

Alan Stern

unread,

Dec 8, 2009, 4:50:03 PM12/8/09

to

On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:

> BTW, is there a good reason why completion_done() doesn't use spin_lock_irqsave
> and spin_unlock_irqrestore? complete() and complete_all() use them, so why not
> here?

And likewise in try_wait_for_completion(). It looks like a bug. Maybe
these routines were not intended to be called with interrupts disabled,
but that requirement doesn't seem to be documented. And it isn't a
natural requirement anyway.

Alan Stern

Rafael J. Wysocki

unread,

Dec 8, 2009, 4:50:03 PM12/8/09

to

On Tuesday 08 December 2009, Linus Torvalds wrote:
>

Understood (I think).

Let's try it, then. Below is the resume patch based on my previous one in this
thread (I have only verified that it builds). Is that along the lines you want?

Rafael

---
drivers/base/power/main.c | 78 ++++++++++++++++++++++++++++++++++++++-----
include/linux/device.h | 6 +++
include/linux/pm.h | 3 +
include/linux/resume-trace.h | 7 +++
4 files changed, 85 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/pm.h
===================================================================
--- linux-2.6.orig/include/linux/pm.h
+++ linux-2.6/include/linux/pm.h

@@ -26,6 +26,7 @@
#include <linux/spinlock.h>
#include <linux/wait.h>
#include <linux/timer.h>
+#include <linux/rwsem.h>

/*
* Callbacks for platform drivers to implement.
@@ -412,9 +413,11 @@ struct dev_pm_info {

pm_message_t power_state;
unsigned int can_wakeup:1;
unsigned int should_wakeup:1;
+ unsigned async_suspend:1;
enum dpm_state status; /* Owned by the PM core */

#ifdef CONFIG_PM_SLEEP
struct list_head entry;

+ struct rw_semaphore rwsem;

#endif
#ifdef CONFIG_PM_RUNTIME
struct timer_list suspend_timer;

+ init_rwsem(&dev->power.rwsem);
pm_runtime_init(dev);
}

@@ -334,25 +337,51 @@ static void pm_dev_err(struct device *de

* The driver of @dev will not receive interrupts while this function is being
* executed.
*/
-static int device_resume_noirq(struct device *dev, pm_message_t state)
+static int __device_resume_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

- if (!dev->bus)
- goto End;

+ down_read(&dev->parent->power.rwsem);

- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "EARLY ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
- End:
+

+ up_read(&dev->parent->power.rwsem);
+ up_write(&dev->power.rwsem);

+
TRACE_RESUME(error);
return error;
}

+static void async_resume_noirq(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+

+ error = __device_resume_noirq(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async EARLY", error);
+ put_device(dev);
+}
+
+static int device_resume_noirq(struct device *dev)
+{

+ down_write(&dev->power.rwsem);

+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume_noirq, dev);
+ return 0;
+ }
+

+ return __device_resume_noirq(dev, pm_transition);
+}
+
/**
* dpm_resume_noirq - Execute "early resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.

@@ -366,32 +395,35 @@ void dpm_resume_noirq(pm_message_t state

mutex_lock(&dpm_list_mtx);
transition_started = false;
+ pm_transition = state;
list_for_each_entry(dev, &dpm_list, power.entry)
if (dev->power.status > DPM_OFF) {
int error;

dev->power.status = DPM_OFF;
- error = device_resume_noirq(dev, state);
+ error = device_resume_noirq(dev);
if (error)
pm_dev_err(dev, state, " early", error);
}

mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();

resume_device_irqs();
}
EXPORT_SYMBOL_GPL(dpm_resume_noirq);

/**
- * device_resume - Execute "resume" callbacks for given device.
+ * __device_resume - Execute "resume" callbacks for given device.
* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/
-static int device_resume(struct device *dev, pm_message_t state)
+static int __device_resume(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

+ down_read(&dev->parent->power.rwsem);
down(&dev->sem);

if (dev->bus) {
@@ -426,11 +458,37 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ up_read(&dev->parent->power.rwsem);
+ up_write(&dev->power.rwsem);

TRACE_RESUME(error);
return error;
}

+static void async_resume(void *data, async_cookie_t cookie)
+{
+ struct device *dev = (struct device *)data;
+ int error;
+

+ error = __device_resume(dev, pm_transition);
+ if (error)
+ pm_dev_err(dev, pm_transition, " async", error);
+ put_device(dev);
+}
+
+static int device_resume(struct device *dev)
+{

+ down_write(&dev->power.rwsem);

+
+ if (dev->power.async_suspend && !pm_trace_is_enabled()) {
+ get_device(dev);
+ async_schedule(async_resume, dev);
+ return 0;
+ }
+

+ return __device_resume(dev, pm_transition);
+}
+
/**
* dpm_resume - Execute "resume" callbacks for non-sysdev devices.
* @state: PM transition of the system being carried out.

@@ -444,6 +502,7 @@ static void dpm_resume(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

@@ -454,7 +513,7 @@ static void dpm_resume(pm_message_t stat

dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);

- error = device_resume(dev, state);
+ error = device_resume(dev);

mutex_lock(&dpm_list_mtx);
if (error)

@@ -469,6 +528,7 @@ static void dpm_resume(pm_message_t stat
}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}

/**

Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@

extern int pm_trace_enabled;

+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const

#else

+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)

Christian Borntraeger

unread,

Dec 8, 2009, 5:00:02 PM12/8/09

to

> > Sure they can. Control dependencies are trivial - it's called "branch
> > prediction", and everybody does it, and data dependencies don't exist on
> > many CPU architectures (even to the point of reading through a pointer
> > that you loaded).
>
> Wait a second. Are you saying that with code like this:
>
> if (x == 1)
> y = 5;
>
> the CPU may write to y before it has finished reading the value of x?
> And this write is visible to other CPUs, so that if x was initially 0
> and a second CPU sets x to 1, the second CPU may see y == 5 before it
> executes the write to x (whatever that may mean)?

No, the write really depends on x being 1 at any time before the comparison.
On the other hand x being != 0 during the comparison does not prevent the
write without proper locking or barriers.

Have a look at

http://www.linuxjournal.com/article/8211
http://www.linuxjournal.com/article/8212

especially at the alpha part what can happen when dealing with pointer accesses.

Christian

Alan Stern

unread,

Dec 8, 2009, 5:10:01 PM12/8/09

to

On Tue, 8 Dec 2009, Linus Torvalds wrote:

> On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > That's not the way it should be done. Linus had children taking their
> > parents' locks during suspend, which is simple but leads to
> > difficulties.
>
> No it doesn't. Name them.

Well, one difficulty. It arises only because we are contemplating
having the PM core fire up the async tasks, rather than having the
drivers' suspend routines launch them (the way your original proposal
did -- the difficulty does not arise there).

Suppose A and B are unrelated devices and we need to impose the
off-tree constraint that A suspends after B. With children taking
their parent's lock, the way to prevent A from suspending too soon is
by having B's suspend routine acquire A's lock.

But B's suspend routine runs entirely in an async task, because that
task is started by the PM core and it does the method call. Hence by
the time B's suspend routine is called, A may already have begun
suspending -- it's too late to take A's lock. To make the locking
work, B would have to acquire A's lock _before_ B's async task starts.
Since the PM core is unaware of the off-tree dependency, there's no
simple way to make it work.

> > Instead, the PM core should do a down_write() on each device before
> > starting the device's async suspend routine, and an up_write() when the
> > routine finishes.
>
> No you should NOT do that. If you do that, you serialize the suspend
> incorrectly and much too early. IOW, think a topology like this:
>
> a -> b -> c
> \
> > d -> e
>
> where you'd want to suspend 'c' and 'e' asynchronously. If we do a
> 'down-write()' on b, then we'll delay until 'c' has suspended, an if we
> have ordered the nodes in the obvious depth-first order, we'll walk the PM
> device list in the order:
>
> c b e d a
>
> and now we'll serialize on 'b', waiting for 'c' to suspend. Which we do
> _not_ want to do, because the whole point was to suspend 'c' and 'e'
> together.

You misunderstand. The suspend algorithm will look like this:

dpm_suspend()
{
list_for_each_entry_reverse(dpm_list, dev) {
down_write(dev->lock);
async_schedule(device_suspend, dev);
}
}

device_suspend(dev)
{
device_for_each_child(dev, child) {
down_read(child->lock);
up_read(child->lock);
}
dev->suspend(dev); /* May do off-tree down+up pairs */
up_write(dev->lock);
}

With completions instead of rwsems, the down_write() changes to
init_completion(), the up_write() changes to complete_all(), and the
down_read()+up_read() pairs change to wait_for_completion().

So 'b' will wait for 'c' to suspend, as it must, but 'e' won't wait for
anything.

> > Parents should, at the start of their async routine,
> > do down_read() on each of their children plus whatever other devices
> > they need to wait for. The core can do the waiting for children part
> > and the driver's suspend routine can handle any other waiting.
>
> Why?
>
> That just complicates things. Compare to my simple locking scheme I've
> quoted several times.

It is a little more complicated in that it involves explicitly
iterating over children. But it is simpler in that it can use
completions instead of rwsems and it avoids the off-tree dependency
problem described above.

Alan Stern

Rafael J. Wysocki

unread,

Dec 8, 2009, 5:10:04 PM12/8/09

to

Ah, I need to check if dev->parent is not NULL before trying to lock it, but
apart from this it doesn't break things at least.

Rafael

Linus Torvalds

unread,

Dec 8, 2009, 5:20:02 PM12/8/09

to

On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > Sure they can. Control dependencies are trivial - it's called "branch
> > prediction", and everybody does it, and data dependencies don't exist on
> > many CPU architectures (even to the point of reading through a pointer
> > that you loaded).
>
> Wait a second. Are you saying that with code like this:
>
> if (x == 1)
> y = 5;
>
> the CPU may write to y before it has finished reading the value of x?

Well, in a way. The branch may have been predicted, and the CPU can
_internally_ have done the 'y=5' thing into a write buffer before it even
did the read.

Some time later it will have to _verify_ the prediction and then perhaps
kill the write before it makes it to a data structure that is visible to
others, but internally from the CPU standpoint, yes, the write could have
happened before the read.

Now, whether that write is "before" or "after" the read is debatable. But
one way of looking at it is certainly that the write took place earlier,
and the read might have just caused it to be undone.

And there are real effects of this - looking at the bus, you might have a
bus transaction to get the cacheline that contains 'y' for exclusive
access happen _before_ the bus transaction that reads in the value of 'x'
(but you'd never see the writeout of that '5' before).

> And this write is visible to other CPUs, so that if x was initially 0
> and a second CPU sets x to 1, the second CPU may see y == 5 before it
> executes the write to x (whatever that may mean)?

Well, yes and no. CPU1 above won't release the '5' until it has confirmed
the '1' (even if it does so by reading it late). but assuming the other
CPU also does speculation, then yes, the situation you describe could
happen. If the other CPU does

z = y;
x = 1;

then it's certainly possible that 'z' contains 5 at the end (even if both
x and y started out zero). Because now the read of 'y' on that other CPU
might be delayed, and the write of 'x' goes ahead, CPU1 sees the 1, and
commits its write of 5, sp when CPU2 gets the cacheline, z will now
contain 5.

Is it likely? No. CPU microarchitectures aim to do reads early, and writes
late. Reads are on the critical path, writes can be buffered. But you can
basically get into "impossible" situations where a write that was _later_
in the instruction stream than a read (on CPU2, the 'store 1 to x' would
be after the load of 'y' from memory) could show up in the other order on
another CPU.

Linus

Linus Torvalds

unread,

Dec 8, 2009, 5:30:02 PM12/8/09

to

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> And likewise in try_wait_for_completion(). It looks like a bug. Maybe
> these routines were not intended to be called with interrupts disabled,
> but that requirement doesn't seem to be documented. And it isn't a
> natural requirement anyway.

'complete()' is supposed to be callable from interrupts, but the waiting
ones aren't. But 'complete()' is all you should need to call from
interrupts, so that's fine.

So I think completions should work, if done right. That whole "make the
parent wait for all the children to complete" is fine in that sense. And
I'll happily take such an approach if my rwlock thing doesn't work.

Linus

Rafael J. Wysocki

unread,

Dec 8, 2009, 5:40:02 PM12/8/09

to

On Tuesday 08 December 2009, Alan Stern wrote:
> On Tue, 8 Dec 2009, Linus Torvalds wrote:
>
> > On Tue, 8 Dec 2009, Alan Stern wrote:
> > >
> > > That's not the way it should be done. Linus had children taking their
> > > parents' locks during suspend, which is simple but leads to
> > > difficulties.
> >
> > No it doesn't. Name them.
>
> Well, one difficulty. It arises only because we are contemplating
> having the PM core fire up the async tasks, rather than having the
> drivers' suspend routines launch them (the way your original proposal
> did -- the difficulty does not arise there).
>
> Suppose A and B are unrelated devices and we need to impose the
> off-tree constraint that A suspends after B. With children taking
> their parent's lock, the way to prevent A from suspending too soon is
> by having B's suspend routine acquire A's lock.
>
> But B's suspend routine runs entirely in an async task, because that
> task is started by the PM core and it does the method call. Hence by
> the time B's suspend routine is called, A may already have begun
> suspending -- it's too late to take A's lock. To make the locking
> work, B would have to acquire A's lock _before_ B's async task starts.
> Since the PM core is unaware of the off-tree dependency, there's no
> simple way to make it work.

Do not set async_suspend for B and instead start your own async thread
from its suspend callback. The parent-children synchronization is done by the
core anyway (at least I'd do it that way), so the only thing you need to worry
about is the extra dependency.

> > That just complicates things. Compare to my simple locking scheme I've
> > quoted several times.
>
> It is a little more complicated in that it involves explicitly
> iterating over children. But it is simpler in that it can use
> completions instead of rwsems and it avoids the off-tree dependency
> problem described above.

I would be slightly more comfortable using completions, but the rwsem-based
approach is fine with me as well.

Rafael

Linus Torvalds

unread,

Dec 8, 2009, 5:40:02 PM12/8/09

to

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> Suppose A and B are unrelated devices and we need to impose the
> off-tree constraint that A suspends after B.

Ah. Ok, I can imagine the off-tree constraints, but part of my "keep it
simple" was to simply not do them. If there are constraints that aren't
in the topology of the tree, then I simply don't think that async is worth
it in the first place.

> You misunderstand. The suspend algorithm will look like this:
>
> dpm_suspend()
> {
> list_for_each_entry_reverse(dpm_list, dev) {
> down_write(dev->lock);
> async_schedule(device_suspend, dev);
> }
> }
>
> device_suspend(dev)
> {
> device_for_each_child(dev, child) {
> down_read(child->lock);
> up_read(child->lock);
> }
> dev->suspend(dev); /* May do off-tree down+up pairs */
> up_write(dev->lock);
> }

Ok, so the above I think work (and see my previous email: I think
completions would be workable there too).

It's just that I think the "looping over children" is ugly, when I think
that by doing it the other way around you can make the code simpler and
only depend on the PM device list and a simple parent pointer access.

I also think that you are wrong that the above somehow protects against
non-topological dependencies. If the device you want to keep delay
yourself suspending for is after you in the list, the down_read() on that
may succeed simply because it hasn't even done its down_write() yet and
you got scheduled early.

But I guess you could do that by walking the list twice (first to lock
them all, then to actually call the suspend function). That whole
two-phase thing, except the first phase _only_ locks, and doesn't do any
callbacks.

Linus

Rafael J. Wysocki

unread,

Dec 8, 2009, 6:00:02 PM12/8/09

to

For completness, below is the full async suspend/resume patch with rwlocks,
that has been (very slightly) tested and doesn't seem to break things.

[Note to Alan: lockdep doesn't seem to complain about the not annotated nested
locks.]

Thanks,
Rafael

---
drivers/base/power/main.c | 195 +++++++++++++++++++++++++++++++++++++++----
include/linux/device.h | 6 +
include/linux/pm.h | 3
include/linux/resume-trace.h | 7 +
4 files changed, 194 insertions(+), 17 deletions(-)

@@ -334,25 +337,53 @@ static void pm_dev_err(struct device *de

* The driver of @dev will not receive interrupts while this function is being
* executed.
*/
-static int device_resume_noirq(struct device *dev, pm_message_t state)
+static int __device_resume_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

TRACE_DEVICE(dev);
TRACE_RESUME(0);

- if (!dev->bus)
- goto End;

+ if (dev->parent)

+ down_read(&dev->parent->power.rwsem);

- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {
pm_dev_dbg(dev, state, "EARLY ");
error = pm_noirq_op(dev, dev->bus->pm, state);
}
- End:
+

+ if (dev->parent)

@@ -366,32 +397,36 @@ void dpm_resume_noirq(pm_message_t state

+ if (dev->parent)

+ down_read(&dev->parent->power.rwsem);
down(&dev->sem);

if (dev->bus) {

@@ -426,11 +461,38 @@ static int device_resume(struct device *
}
End:
up(&dev->sem);
+ if (dev->parent)

@@ -444,6 +506,7 @@ static void dpm_resume(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

@@ -454,7 +517,7 @@ static void dpm_resume(pm_message_t stat

dev->power.status = DPM_RESUMING;
mutex_unlock(&dpm_list_mtx);

- error = device_resume(dev, state);
+ error = device_resume(dev);

mutex_lock(&dpm_list_mtx);
if (error)

@@ -469,6 +532,7 @@ static void dpm_resume(pm_message_t stat

}
list_splice(&list, &dpm_list);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
}

/**

@@ -533,6 +597,8 @@ static void dpm_complete(pm_message_t st
mutex_unlock(&dpm_list_mtx);
}

+static atomic_t async_error;
+
/**
* dpm_resume_end - Execute "resume" callbacks and complete system transition.

* @state: PM transition of the system being carried out.

@@ -580,20 +646,59 @@ static pm_message_t resume_event(pm_mess

* The driver of @dev will not receive interrupts while this function is being
* executed.
*/

-static int device_suspend_noirq(struct device *dev, pm_message_t state)
+static int __device_suspend_noirq(struct device *dev, pm_message_t state)
{
int error = 0;

- if (!dev->bus)
- return 0;
+ down_write(&dev->power.rwsem);

- if (dev->bus->pm) {
+ if (dev->bus && dev->bus->pm) {

pm_dev_dbg(dev, state, "LATE ");

error = pm_noirq_op(dev, dev->bus->pm, state);
}

+
+ up_write(&dev->power.rwsem);
+ if (dev->parent)

+ up_read(&dev->parent->power.rwsem);
+

return error;
}

+static void async_suspend_noirq(void *data, async_cookie_t cookie)

+{
+ struct device *dev = (struct device *)data;

+ int error = atomic_read(&async_error);
+
+ if (error) {
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);
+ dev->power.status = DPM_OFF;
+ return;
+ }
+
+ error = __device_suspend_noirq(dev, pm_transition);
+ if (error) {
+ pm_dev_err(dev, pm_transition, " async LATE", error);
+ dev->power.status = DPM_OFF;
+ atomic_set(&async_error, error);
+ }
+ put_device(dev);
+}
+
+static int device_suspend_noirq(struct device *dev)
+{
+ if (dev->parent)
+ down_read(&dev->parent->power.rwsem);
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend_noirq, dev);

+ return 0;
+ }
+

+ return __device_suspend_noirq(dev, pm_transition);
+}
+
/**
* dpm_suspend_noirq - Execute "late suspend" callbacks for non-sysdev devices.

* @state: PM transition of the system being carried out.

@@ -608,15 +713,21 @@ int dpm_suspend_noirq(pm_message_t state

suspend_device_irqs();

mutex_lock(&dpm_list_mtx);
+ pm_transition = state;

list_for_each_entry_reverse(dev, &dpm_list, power.entry) {
- error = device_suspend_noirq(dev, state);
+ dev->power.status = DPM_OFF_IRQ;
+ error = device_suspend_noirq(dev);
if (error) {
pm_dev_err(dev, state, " late", error);
+ dev->power.status = DPM_OFF;
break;
}
- dev->power.status = DPM_OFF_IRQ;
+ error = atomic_read(&async_error);
+ if (error)
+ break;
}
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
if (error)
dpm_resume_noirq(resume_event(state));
return error;
@@ -628,10 +739,11 @@ EXPORT_SYMBOL_GPL(dpm_suspend_noirq);

* @dev: Device to handle.
* @state: PM transition of the system being carried out.
*/

-static int device_suspend(struct device *dev, pm_message_t state)
+static int __device_suspend(struct device *dev, pm_message_t state)
{
int error = 0;

+ down_write(&dev->power.rwsem);
down(&dev->sem);

if (dev->class) {
@@ -668,10 +780,50 @@ static int device_suspend(struct device
}
End:
up(&dev->sem);
+ up_write(&dev->power.rwsem);
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);

return error;
}

+static void async_suspend(void *data, async_cookie_t cookie)

+{
+ struct device *dev = (struct device *)data;

+ int error = atomic_read(&async_error);
+
+ if (error) {
+ if (dev->parent)
+ up_read(&dev->parent->power.rwsem);
+ dev->power.status = DPM_SUSPENDING;
+ goto End;
+ }
+
+ error = __device_suspend(dev, pm_transition);

+ if (error) {
+ pm_dev_err(dev, pm_transition, " async", error);

+ dev->power.status = DPM_SUSPENDING;
+ atomic_set(&async_error, error);
+ }
+
+ End:
+ put_device(dev);
+}
+
+static int device_suspend(struct device *dev, pm_message_t state)
+{
+ if (dev->parent)
+ down_read(&dev->parent->power.rwsem);
+
+ if (dev->power.async_suspend) {
+ get_device(dev);
+ async_schedule(async_suspend, dev);

+ return 0;
+ }
+

+ return __device_suspend(dev, pm_transition);
+}
+
/**
* dpm_suspend - Execute "suspend" callbacks for all non-sysdev devices.

* @state: PM transition of the system being carried out.

@@ -683,10 +835,12 @@ static int dpm_suspend(pm_message_t stat

INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
+ pm_transition = state;
while (!list_empty(&dpm_list)) {

struct device *dev = to_device(dpm_list.prev);

get_device(dev);
+ dev->power.status = DPM_OFF;
mutex_unlock(&dpm_list_mtx);

error = device_suspend(dev, state);
@@ -694,16 +848,22 @@ static int dpm_suspend(pm_message_t stat
mutex_lock(&dpm_list_mtx);
if (error) {
pm_dev_err(dev, state, "", error);
+ dev->power.status = DPM_SUSPENDING;
put_device(dev);
break;
}
- dev->power.status = DPM_OFF;
if (!list_empty(&dev->power.entry))
list_move(&dev->power.entry, &list);
put_device(dev);
+ error = atomic_read(&async_error);
+ if (error)
+ break;
}
list_splice(&list, dpm_list.prev);
mutex_unlock(&dpm_list_mtx);
+ async_synchronize_full();
+ if (!error)
+ error = atomic_read(&async_error);
return error;
}

@@ -762,6 +922,7 @@ static int dpm_prepare(pm_message_t stat
INIT_LIST_HEAD(&list);
mutex_lock(&dpm_list_mtx);
transition_started = true;
+ atomic_set(&async_error, 0);

while (!list_empty(&dpm_list)) {
struct device *dev = to_device(dpm_list.next);

Index: linux-2.6/include/linux/resume-trace.h
===================================================================
--- linux-2.6.orig/include/linux/resume-trace.h
+++ linux-2.6/include/linux/resume-trace.h
@@ -6,6 +6,11 @@

extern int pm_trace_enabled;

+static inline int pm_trace_is_enabled(void)
+{
+ return pm_trace_enabled;
+}
+
struct device;
extern void set_trace_device(struct device *);
extern void generate_resume_trace(const void *tracedata, unsigned int user);
@@ -17,6 +22,8 @@ extern void generate_resume_trace(const

#else

+static inline int pm_trace_is_enabled(void) { return 0; }
+
#define TRACE_DEVICE(dev) do { } while (0)
#define TRACE_RESUME(dev) do { } while (0)

Rafael J. Wysocki

unread,

Dec 8, 2009, 6:30:02 PM12/8/09

to

BTW, I can easily change it so that it uses completions for synchronization,
but I'm not sure if that's worth spending time on, so please let me know.

Rafael