Back to the future.

Nigel Cunningham

unread,

Apr 26, 2007, 2:10:10 AM4/26/07

to

Hi again.

So - trying to get back to the original discussion - what (if anything)
do you see as the way ahead?

The options I can think of are (starting with things I can do):

1) I stop developing Suspend2, thereby pushing however many current
Suspend2 users to move to [u]swsusp and seek to get that up to speed.

2) I quit my day job, see if Redhat will take me full time and give me
the time to start trying to merge Suspend2 bit by bit. Alternatively,
days suddenly become 8 hours longer and I discover the boundless energy
and alertness needed to do this too :). Ok. Not going to happen.

3) Someone else steps up to the plate and tries to merge Suspend2 one
bit at a time.

4) uswsusp and swsusp get dropped and Suspend2 goes into mainline.

5) Everything gets dropped and we start from scratch.

6) The status quo - or some small variant of it - stays. Oh... I said
"way ahead". I guess that rules this one out, even though I'll be very
surprised if it's not the one that wins out.

7) Suspend2 gets merged and people get to choose which they like better.
Nearly forgot this as a conceivable possibility. Yeah, I know you said
you don't want it. I'm just trying to think of what might possibly
happen.

N.

signature.asc

Pekka Enberg

unread,

Apr 26, 2007, 3:40:06 AM4/26/07

to

On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> 3) Someone else steps up to the plate and tries to merge Suspend2 one
> bit at a time.

So which bits do we want to merge? For example, Suspend2
kernel/power/ui.c, kernel/power/compression.c, and
kernel/power/encryption.c seem pointless now that we have uswsusp.
Furthermore, being the shameless Linus cheerleader that I am, I got
the impression that we should fix the snapshot/shutdown logic in the
kernel which Suspend2 doesn't really address?

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nigel Cunningham

unread,

Apr 26, 2007, 3:50:08 AM4/26/07

to

Hi.

On Thu, 2007-04-26 at 10:28 +0300, Pekka Enberg wrote:
> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> > 3) Someone else steps up to the plate and tries to merge Suspend2 one
> > bit at a time.
>
> So which bits do we want to merge? For example, Suspend2
> kernel/power/ui.c, kernel/power/compression.c, and
> kernel/power/encryption.c seem pointless now that we have uswsusp.
> Furthermore, being the shameless Linus cheerleader that I am, I got
> the impression that we should fix the snapshot/shutdown logic in the
> kernel which Suspend2 doesn't really address?

I agree that the driver logic could be addressed too, but to answer your
question...

* Doing things in the right order? (Prepare the image, then do the
atomic copy, then save).
* Mulithreaded I/O (might as well use multiple cores to compress the
image, now that we're hotplugging later).
* Support for > 1 swap device.
* Support for ordinary files.
* Full image option.
* Modular design?

Regards,

Nigel

signature.asc

Pekka Enberg

unread,

Apr 26, 2007, 4:30:09 AM4/26/07

to

Hi Nigel,

On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> * Doing things in the right order? (Prepare the image, then do the
> atomic copy, then save).

As I am a total newbie to the power management code, I am unable to
spot the conceptual difference in uswsusp suspend.c:suspend_system()
and suspend2 kernel/power/suspend.c:suspend_main(). How are they
different?

On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> * Mulithreaded I/O (might as well use multiple cores to compress the
> image, now that we're hotplugging later).

I assume this doesn't affect the kernel at all with uswsusp?

On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:

> * Modular design?

This is too broad. Please be more specific of the problems the current
suspend and snapshot/shutdown code in the kernel has.

Now to add to your list, as far as I can tell, suspend2 provides
better feedback to the user via the netlink mechanism (although the
kernel shouldn't be sending messages such as userui_redraw but instead
let the userspace know of the actual events, for example, that tasks
have now been frozen). However, I am unsure if this is still relevant
as most of the work (snapshot writing) is being done in userspace
where we explicitly know when processes have been frozen, when the
snapshot is finished, and when it's written to disk.

Jan Engelhardt

unread,

Apr 26, 2007, 4:50:12 AM4/26/07

to

On Apr 26 2007 16:04, Nigel Cunningham wrote:
>
>Hi again.
>
>So - trying to get back to the original discussion - what (if anything)
>do you see as the way ahead?
>
>The options I can think of are (starting with things I can do):
>

>1) [...]
>2) [...]
>3) [...]
>4) [...]
>5) [...]
>6) [...]
>7) [...]

Perhaps do it the EVMS way? Do as much in userspace as possible, and
trying having a simple kernel API at the same time.
Perhaps (3) would be it, but ask Redhat _first_ before quitting anything :)

Jan
--

Nigel Cunningham

unread,

Apr 26, 2007, 9:52:44 AM4/26/07

to

Hi.

On Thu, 2007-04-26 at 11:17 +0300, Pekka Enberg wrote:
> Hi Nigel,
>
> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
>
> As I am a total newbie to the power management code, I am unable to
> spot the conceptual difference in uswsusp suspend.c:suspend_system()
> and suspend2 kernel/power/suspend.c:suspend_main(). How are they
> different?

Will discuss in irc since you've appeared there...

> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> > * Mulithreaded I/O (might as well use multiple cores to compress the
> > image, now that we're hotplugging later).
>
> I assume this doesn't affect the kernel at all with uswsusp?

Well uswsusp would benefit from using multiple threads - if it can - to
do the work. I saw quite an improvement from implementing it.

> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> > * Modular design?
>
> This is too broad. Please be more specific of the problems the current
> suspend and snapshot/shutdown code in the kernel has.

Did you see the 'Reasons to merge' email I sent? It has more detail on
this.

> Now to add to your list, as far as I can tell, suspend2 provides
> better feedback to the user via the netlink mechanism (although the
> kernel shouldn't be sending messages such as userui_redraw but instead
> let the userspace know of the actual events, for example, that tasks
> have now been frozen). However, I am unsure if this is still relevant
> as most of the work (snapshot writing) is being done in userspace
> where we explicitly know when processes have been frozen, when the
> snapshot is finished, and when it's written to disk.

From uswsusp's point of view, yeah. But I'm still coming from the 'doing
this in kernelspace makes far more sense' perspective.

Regards,

Nigel

signature.asc

Nigel Cunningham

unread,

Apr 26, 2007, 9:57:53 AM4/26/07

to

Hi.

On Thu, 2007-04-26 at 10:38 +0200, Jan Engelhardt wrote:
> On Apr 26 2007 16:04, Nigel Cunningham wrote:
> >
> >Hi again.
> >
> >So - trying to get back to the original discussion - what (if anything)
> >do you see as the way ahead?
> >
> >The options I can think of are (starting with things I can do):
> >
> >1) [...]
> >2) [...]
> >3) [...]
> >4) [...]
> >5) [...]
> >6) [...]
> >7) [...]
>
> Perhaps do it the EVMS way? Do as much in userspace as possible, and
> trying having a simple kernel API at the same time.
> Perhaps (3) would be it, but ask Redhat _first_ before quitting anything :)

:) Well, the EVMS way is swsusp. Personally, I agree with Linus that
think putting suspend to disk code in userspace is just a broken idea.

Regards,

Nigel

signature.asc

Linus Torvalds

unread,

Apr 26, 2007, 1:00:10 PM4/26/07

to

On Thu, 26 Apr 2007, Nigel Cunningham wrote:
>
> * Doing things in the right order? (Prepare the image, then do the
> atomic copy, then save).

I'd actually like to discuss this a bit..

I'm obviously not a huge fan of the whole user/kernel level split and
interfaces, but I actually do think that there is *one* split that makes
sense:

- generate the (whole) snapshot image entirely inside the kernel

- do nothing else (ie no IO at all), and just export it as a single image
to user space (literally just mapping the pages into user space).
*one* interface. None of the "pretty UI update" crap. Just a single
system call:

void *snapshot_system(u32 *size);

which will map in the snapshot, return the mapped address and the size
(and if you want to support snapshots > 4GB, be my guest, but I suspect
you're actually *better* off just admitting that if you cannot shrink
the snapshot to less than 32 bits, it's not worth doing)

User space gets a fully running system, with that one process having that
one image mapped into its address space. It can then compress/write/do
whatever to that snapshot.

You need one other system call, of course, which is

int resume_snapshot(void *snapshot, u32 size);

and for testing, you should be able to basically do

u32 size;
void *buffer = snapshot_system(&size);
if (buffer != MAP_FAILED)
resume_snapshot(buffer, size);

and it should obviously work.

And btw, the device model changes are a big part of this. Because I don't
think it's even remotely debuggable with the full suspend/resume of the
devices being part of generating the image! That freeze/snapshot/unfreeze
sequence is likely a lot more debuggable, if only because freeze/unfreeze
is actually a no-op for most devices, and snapshotting is trivial too.

Once you have that snapshot image in user space you can do anything you
want. And again: you'd hav a fully working system: not any degradation
*at*all*. If you're in X, then X will continue running etc even after the
snapshotting, although obviously the snapshotting will have tried to page
a lot of stuff out in order to make the snapshot smaller, so you'll likely
be crawling.

> * Mulithreaded I/O (might as well use multiple cores to compress the
> image, now that we're hotplugging later).
> * Support for > 1 swap device.
> * Support for ordinary files.
> * Full image option.
> * Modular design?

I'd really suggest _just_ the "full image". Nothing else is probably ever
worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as
"echo disk > /sys/power/state", but it should not necessarily be much
worse than

snapshot_kernel | gzip -9 > /dev/snapshot

either (and resuming from the snapshot would just be the reverse)!

And if you want to send the snapshot over a TCP connection to another
host, be my guest. With pretty images while it's transferring. Whatever.

Linus

Xavier Bestel

unread,

Apr 26, 2007, 1:10:07 PM4/26/07

to

On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting

Won't there be problems if e.g. X tries to write something to its
logfile after snapshot ?

Xav

Linus Torvalds

unread,

Apr 26, 2007, 1:20:14 PM4/26/07

to

On Thu, 26 Apr 2007, Linus Torvalds wrote:
>
> Once you have that snapshot image in user space you can do anything you
> want.

Side note: the exception, of course, is page out more. The swap device has
to be read-only.

We actually have support for that mode (it's how "swapoff" works: it marks
swap devices as not accepting _new_ entries, even though old entries are
still valid). So you can have a fully running system, with 99% of memory
swapped out, and still guarantee that you won't swap out anything *more*
(which would destroy the swap image, which you don't want, since it's
where a lot of the memory may end up being, in order to make the snapshot
itself as small as possible)!

Anybody who cares can look at the code that messes with the the
SWP_WRITEOK flag. You'd basically swap out enough to make the snapshot
image fit comfortably in memory, and then you'd clear SWP_WRITEOK on all
swap devices and return to user space. Or something very close to that.

But the point here is that we should actually really be able to have a
fully working system, even _after_ we created the snapshot. I don't even
think you should need any "initrd only" kind of situation.

If somebody can do that, with just those two system calls, I'll remove
every other suspend-to-disk wannabe from the kernel in a heartbeat. I may
have missed something subtle, of course, but I really *think* it should be
doable.

Linus Torvalds

unread,

Apr 26, 2007, 1:40:11 PM4/26/07

to

On Thu, 26 Apr 2007, Xavier Bestel wrote:
>
> Won't there be problems if e.g. X tries to write something to its
> logfile after snapshot ?

Sure. But that's a user-level issue.

You do have to allow writing after snapshotting, since at a minimum, you'd
want the snapshot itself to be written. So the kernel has to be fully
running, and support full user space. No "degraded mode" like now.

So when I said "fully running user mode", I really meant it from the
perspective of the kernel - not necessarily from the perspective of the
"user". You do want to limit _what_ user mode does, but you must not limit
it by making the kernel less capable.

Remounting mounted filesystems read-only sounds like a good idea, for
example. We can do that. We have the technology. But we shouldn't limit
user space from doing other things (for example, it might want to actually
*mount* a new filesystem for writing the snapshot image).

For example, right now we try to "fix" that with the whole process freezer
thing. And that code has *caused* more problems than it fixed, since it
tries to freeze all the kernel threads etc, and you simply don't have a
truly *working*system*.

I think it's fine to freeze processes if that is what you want to do (for
example, send them SIGSTOP), but we freeze them *too*much* right now, and
the suspend stuff has taken over policy way too much. We don't actually
leave the system in a runnable state. I can almost guarantee that you'd be
*better* off having the snapshot taking thing do a

kill(-1, SIGSTOP);

in user space than our current broken process freezer. At least that
wouldn't have screwed up those kernel threads as badly as swsusp did.

And no, I'm not saying that my suggestion is the only way to do it. Go
wild. But the *current* situation is just broken. Three different things,
none of which people can agree on. I'd *much* rather see a conceptually
simpler approach that then required, but even more important is that right
now people aren't even discussing alternatives, they're just pushing one
of the three existing things, and that's simply not viable. Because I'm
not merging another one.

In fact, I personally feel that I shouldn't even have merged
userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
feelings simply don't matter that much. I have to trust people. But yes,
as far as *I* am personally concerned, I think it was a mistake to merge
it.

Linus

Luca Tettamanti

unread,

Apr 26, 2007, 1:40:17 PM4/26/07

to

Nigel Cunningham <ni...@nigel.suspend2.net> ha scritto:

> On Thu, 2007-04-26 at 11:17 +0300, Pekka Enberg wrote:
>> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
>> > * Mulithreaded I/O (might as well use multiple cores to compress the
>> > image, now that we're hotplugging later).
>>
>> I assume this doesn't affect the kernel at all with uswsusp?
>
> Well uswsusp would benefit from using multiple threads - if it can - to
> do the work. I saw quite an improvement from implementing it.

It's doable[1], but I'm not sure that the added complexity is worth of it.
I'm suprised that you see a big improvement. I'd expect that the image
write is bottlenecked by the disk performance. On my PC (Core2, locked
at 1.6GHz) lzf can compress 250-280MB/s; even with an older CPU that can
do 1/3 it's still more than the disk can handle.

Luca
[1] We may even use MPI to compress over a Beowulf cluster, it's
userspace ;)
--
"Ricorda sempre che sei unico, esattamente come tutti gli altri".

Chase Venters

unread,

Apr 26, 2007, 3:20:12 PM4/26/07

to

On Thu, 26 Apr 2007, Linus Torvalds wrote:

>
> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting, although obviously the snapshotting will have tried to page
> a lot of stuff out in order to make the snapshot smaller, so you'll likely
> be crawling.
>

In fact... If you're just paging out to make a smaller snapshot (ie, not
to free up memory), couldn't you just swap it out (if it's not backed by a
file) then mark it as "half-released"... ie, the snapshot writing code
ignores it knowing that it will be available on disk at resume, but then
when the snapshot is complete it's still available in physical RAM,
preventing user-space from crawling due to the necessity of paging it all
back in?

Thanks,
Chase

David Lang

unread,

Apr 26, 2007, 3:30:14 PM4/26/07

to

On Thu, 26 Apr 2007, Chase Venters wrote:

> On Thu, 26 Apr 2007, Linus Torvalds wrote:
>
>>
>> Once you have that snapshot image in user space you can do anything you
>> want. And again: you'd hav a fully working system: not any degradation
>> *at*all*. If you're in X, then X will continue running etc even after the
>> snapshotting, although obviously the snapshotting will have tried to page
>> a lot of stuff out in order to make the snapshot smaller, so you'll likely
>> be crawling.
>>
>
> In fact... If you're just paging out to make a smaller snapshot (ie, not
> to free up memory), couldn't you just swap it out (if it's not backed by a
> file) then mark it as "half-released"... ie, the snapshot writing code
> ignores it knowing that it will be available on disk at resume, but then
> when the snapshot is complete it's still available in physical RAM,
> preventing user-space from crawling due to the necessity of paging it all
> back in?

your swap space may end up being re-used before you restore with std

David Lang

Nigel Cunningham

unread,

Apr 26, 2007, 4:00:14 PM4/26/07

to

Hi.

On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
>
> On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> >
> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
>
> I'd actually like to discuss this a bit..
>
> I'm obviously not a huge fan of the whole user/kernel level split and
> interfaces, but I actually do think that there is *one* split that makes
> sense:
>
> - generate the (whole) snapshot image entirely inside the kernel
>
> - do nothing else (ie no IO at all), and just export it as a single image
> to user space (literally just mapping the pages into user space).
> *one* interface. None of the "pretty UI update" crap. Just a single
> system call:
>
> void *snapshot_system(u32 *size);
>
> which will map in the snapshot, return the mapped address and the size
> (and if you want to support snapshots > 4GB, be my guest, but I suspect
> you're actually *better* off just admitting that if you cannot shrink
> the snapshot to less than 32 bits, it's not worth doing)

That inherently limits the image to half of available ram (you need
somewhere to store the snapshot), so you won't get the full image you
express interest in below.

> User space gets a fully running system, with that one process having that
> one image mapped into its address space. It can then compress/write/do
> whatever to that snapshot.

You're describing uswsusp! (At least in so far as I understand it!).

You can't get a fully running system though, because if anything changes
something on disk that was snapshotted (super blocks etc) your snapshot
is invalid and you risk on-disk corruption.

> And btw, the device model changes are a big part of this. Because I don't
> think it's even remotely debuggable with the full suspend/resume of the
> devices being part of generating the image! That freeze/snapshot/unfreeze
> sequence is likely a lot more debuggable, if only because freeze/unfreeze
> is actually a no-op for most devices, and snapshotting is trivial too.
>
> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting, although obviously the snapshotting will have tried to page
> a lot of stuff out in order to make the snapshot smaller, so you'll likely
> be crawling.

Nooooooo! See above about disk corruption.

> > * Mulithreaded I/O (might as well use multiple cores to compress the
> > image, now that we're hotplugging later).
> > * Support for > 1 swap device.
> > * Support for ordinary files.
> > * Full image option.
> > * Modular design?
>
> I'd really suggest _just_ the "full image". Nothing else is probably ever
> worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as
> "echo disk > /sys/power/state", but it should not necessarily be much
> worse than

Please, go apply that logic elsewhere, then cut out (or at least stop
adding) support for users with less common needs in other areas. I fully
acknowledge that most users have only one place to store their image and
it's a swap device. But that doesn't mean one size fits all.

A full image implies that you need to figure out what's not going to
change while you're writing it and save that separately. At the moment,
I'm treating most of the LRU contents as that list. If we're going to
start trying to let every man and his dog run while we're trying to
snapshot the system, that's not going to work anymore - or the logic
will get a lot more complicated.

Sorry. I never thought I'd say this, but I think you're being naive
about how simple the process of snapshotting a system is.

Regards,

Nigel

signature.asc

Nigel Cunningham

unread,

Apr 26, 2007, 4:10:13 PM4/26/07

to

Hi.

On Thu, 2007-04-26 at 10:34 -0700, Linus Torvalds wrote:
>
> On Thu, 26 Apr 2007, Xavier Bestel wrote:
> >
> > Won't there be problems if e.g. X tries to write something to its
> > logfile after snapshot ?
>
> Sure. But that's a user-level issue.
>
> You do have to allow writing after snapshotting, since at a minimum, you'd
> want the snapshot itself to be written. So the kernel has to be fully
> running, and support full user space. No "degraded mode" like now.

It doesn't need a fully functional userspace (unless you want to write
to a fuse device, and even then that could be worked around - make it
like uswsusp or userui).... can I deverge for a second and say that from
this point of view, fuse is the lamest idea ever invented. Guaranteed to
break your ability to suspend^Wsnapshot.... Anyhow, if the kernel has
bmapped the pages it's going to write to beforehand, it knows where the
image needs to go. No need for userspace at all.

> So when I said "fully running user mode", I really meant it from the
> perspective of the kernel - not necessarily from the perspective of the
> "user". You do want to limit _what_ user mode does, but you must not limit
> it by making the kernel less capable.
>
> Remounting mounted filesystems read-only sounds like a good idea, for
> example. We can do that. We have the technology. But we shouldn't limit
> user space from doing other things (for example, it might want to actually
> *mount* a new filesystem for writing the snapshot image).

We tried that. It would need some work. IIRC remounting filesystems
read-only makes files become marked read-only. Perfectly sensible,
except that if you then remount the filesystem rw at resume time, all
those files are still marked ro and userspace crashes and burns. Not
unfixable, I'll agree, but there is more work to do there.

As to the example, mounting a new filesystem for writing the snapshot
image should probably be done before we do the snapshot. Then it won't
be in danger of triggering anything that might require one of the other
fses to be rw (eg syslog).

> For example, right now we try to "fix" that with the whole process freezer
> thing. And that code has *caused* more problems than it fixed, since it
> tries to freeze all the kernel threads etc, and you simply don't have a
> truly *working*system*.

Yes, it has been difficult. But so is bringing up a child.

> I think it's fine to freeze processes if that is what you want to do (for
> example, send them SIGSTOP), but we freeze them *too*much* right now, and
> the suspend stuff has taken over policy way too much. We don't actually
> leave the system in a runnable state. I can almost guarantee that you'd be
> *better* off having the snapshot taking thing do a
>
> kill(-1, SIGSTOP);
>
> in user space than our current broken process freezer. At least that
> wouldn't have screwed up those kernel threads as badly as swsusp did.

I don't think it's fair to blame swsusp there. Maybe cpu hotplugging...

> And no, I'm not saying that my suggestion is the only way to do it. Go
> wild. But the *current* situation is just broken. Three different things,
> none of which people can agree on. I'd *much* rather see a conceptually
> simpler approach that then required, but even more important is that right
> now people aren't even discussing alternatives, they're just pushing one
> of the three existing things, and that's simply not viable. Because I'm
> not merging another one.
>
> In fact, I personally feel that I shouldn't even have merged
> userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
> feelings simply don't matter that much. I have to trust people. But yes,
> as far as *I* am personally concerned, I think it was a mistake to merge
> it.

Perhaps you should try to make an alternative yourself instead of
pushing us into making something we don't believe will work (my case) or
have already done but in a way you don't like (Rafael). Don't talk about
Pavel cutting code. He's just acking/nacking what Rafael sends him.

Nigel

signature.asc

Linus Torvalds

unread,

Apr 26, 2007, 4:50:16 PM4/26/07

to

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
>
> Perhaps you should try to make an alternative yourself instead of
> pushing us into making something we don't believe will work (my case) or
> have already done but in a way you don't like (Rafael). Don't talk about
> Pavel cutting code. He's just acking/nacking what Rafael sends him.

I've done that in the past (USB, PCMCIA - screw the maintainers, redo
it basically from scratch). But the thing is, I'm totally uninterested
personally in the whole disk-snapshotting, so I'm not likely to do it
there.

But yes, I'm actually hoping that some new person will come in with a new
idea. The current people seem to be too set in "their" corners, and I
don't expect that to really change.

Quite honestly, I don't foresee any of the current tree approaches really
doing something new and obviously better, unless somebody new steps in.

Nigel Cunningham

unread,

Apr 26, 2007, 5:00:09 PM4/26/07

to

Hi.

On Thu, 2007-04-26 at 13:45 -0700, Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> >
> > Perhaps you should try to make an alternative yourself instead of
> > pushing us into making something we don't believe will work (my case) or
> > have already done but in a way you don't like (Rafael). Don't talk about
> > Pavel cutting code. He's just acking/nacking what Rafael sends him.
>
> I've done that in the past (USB, PCMCIA - screw the maintainers, redo
> it basically from scratch). But the thing is, I'm totally uninterested
> personally in the whole disk-snapshotting, so I'm not likely to do it
> there.
>
> But yes, I'm actually hoping that some new person will come in with a new
> idea. The current people seem to be too set in "their" corners, and I
> don't expect that to really change.
>
> Quite honestly, I don't foresee any of the current tree approaches really
> doing something new and obviously better, unless somebody new steps in.

That's because there is no other possibility. Sooner or later you have
to do a snapshot, and somehow you have to save it. You're not going to
get a new solution, just one that do those basic things in new and
better ways.

I'm perfectly willing to think through some alternate approach if you
suggest something or prod my thinking in a new direction, but I'm afraid
I just can't see right now how we can achieve what you're after.

Nigel

signature.asc

Theodore Tso

unread,

Apr 26, 2007, 5:40:40 PM4/26/07

to

On Fri, Apr 27, 2007 at 06:08:01AM +1000, Nigel Cunningham wrote:
> We tried that. It would need some work. IIRC remounting filesystems
> read-only makes files become marked read-only. Perfectly sensible,
> except that if you then remount the filesystem rw at resume time, all
> those files are still marked ro and userspace crashes and burns. Not
> unfixable, I'll agree, but there is more work to do there.

There are other solutions, though. One is that we could export a
system call interface which freezes a filesystem and prevents any
further I/O. We mostly have something like that right now (via the
the write_super_lockfs function in the superblock operations
structure), but we haven't exported it to userspace. And right now
not all filesystems support it, but in theory that could be fixed (or
you only suppor suspend/resume if all filesystems support lockfs).

We would also need a similar interface to freeze any block device I/O,
in case you have a database running and doing direct I/O to a block
device. (Or again, we could simply not support that case; how many
people will be running running a database accessing a block deivce on
their laptop?)

So in order to do this right, we would have to double the number of
new interfaces needed from the two proposed by Linus --- which is why
I think the userspace suspend solution is fundamentally NOT the right
one. Rather the right one is the one which Linux ultimately used for
PCMCIA, which is to do it all in the kernel.

- Ted

Rafael J. Wysocki

unread,

Apr 26, 2007, 6:10:34 PM4/26/07

to

On Thursday, 26 April 2007 22:08, Nigel Cunningham wrote:
[--snip--]

> > And no, I'm not saying that my suggestion is the only way to do it. Go
> > wild. But the *current* situation is just broken. Three different things,
> > none of which people can agree on. I'd *much* rather see a conceptually
> > simpler approach that then required, but even more important is that right
> > now people aren't even discussing alternatives, they're just pushing one
> > of the three existing things, and that's simply not viable. Because I'm
> > not merging another one.
> >
> > In fact, I personally feel that I shouldn't even have merged
> > userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
> > feelings simply don't matter that much. I have to trust people. But yes,
> > as far as *I* am personally concerned, I think it was a mistake to merge
> > it.
>
> Perhaps you should try to make an alternative yourself instead of
> pushing us into making something we don't believe will work (my case) or
> have already done but in a way you don't like (Rafael). Don't talk about
> Pavel cutting code. He's just acking/nacking what Rafael sends him.

Well, I think that much of what Linus is saying indicates that he hasn't tried
to write any such thing himself. ;-)

Anyway, I'm tired of all this thing. Really. I've just been trying to make
things _work_ more-or-less reliably in a way that Pavel liked and I really
didn't know that much about the kernel when I started. In fact, I started as a
user who needed certain functionality from the kernel and that was not there
at the time. I've made some mistakes because of that (like the definitions of
the ioctl numbers in suspend.h - this was just a rookie mistake, and I'm
ashamed of it, but _nobody_ catched it, although I believe many people were
looking at the patch).

Now that I know much more than before, I can say I agree with Linus on his
opinion about the separation of s2ram form the snapshot/restore functionality
(I'll call it 'hibernation' for simplicity from now on). It should be done,
because it would make things simpler and cleaner. Still, it will be difficult
to do without screwing users en masse and that's my main concern here.

I don't agree that we don't need the tasks freezer for suspending and
hibernation. We need it, because we need to be sure that the (other) tasks
will not get us in the way, and that also applies to kernel threads (and I
don't think the tasks freezer is 'screwing' them, BTW).

I agree that the userland interface for swsusp is not very nice and I'm going
to do my best to clean that up. I hope that someone will help me, but if not,
then that's fine. OTOH, it's difficult, if not impossible, to do a
userland-driven hibernation in a completely clean way. I've tried that and I'm
not exactly satisfied with the result, although it works and some distros use
it. I wouldn't have done it again, but then I'm going to support the existing
users, as I promised.

Now, I think that the hibernation should better be done completely in the
kernel, because that's just conceptually simpler, although some data exchange
with the user land may be acceptable for some optional fancy stuff. I'm also
tierd of the endless "to merge or not to merge suspend2" discussions that just
lead to nowhere. For these reasons I declare that I'm ready to cooperate with
Nigel on integrating as much of suspend2 as reasonably possible into the
existing infrastructure, under the following conditions:
- we don't remove the existing user-visible interfaces
- we work on one piece of code at a time
- we avoid code duplication, as much as possible
- we avoid using open-coded things, if possible
- if we don't agree on something, we ask someone wiser (volunteers welcome ;-))

If that's acceptable, we can start tomorrow. In the process, we can try to
separate the hibernation code paths from the s2ram ones, but that will require
a lot of knowledge about things that neither me nor Nigel, AFAICT, are very
familiar with, like writing device drivers.

Greetings,
Rafael

Nigel Cunningham

unread,

Apr 26, 2007, 6:30:16 PM4/26/07

to

Hi Rafael.

I don't want to remove user visible interfaces either (I understand that
you mean the ioctls by that?). Perhaps we can find a way to make them
still usable with a more in-kernel solution (ie some things become
noops?).

> - we work on one piece of code at a time

Sure. We should spend some time discussing and planning beforehand so we
don't waste time and effort writing and rewriting.

> - we avoid code duplication, as much as possible

No problem there.

> - we avoid using open-coded things, if possible

Regarding open-coded things, I assume you're referring to the extents. I
would argue that they're not open-coded because list.h implements doubly
linked lists, and extents use a singly linked list. That said, I suppose
we could make the extents doubly linked and use list.h, even though that
would be a waste of 4/8 bytes per extent.

> - if we don't agree on something, we ask someone wiser (volunteers welcome ;-))

Absolutely!

> If that's acceptable, we can start tomorrow. In the process, we can try to
> separate the hibernation code paths from the s2ram ones, but that will require
> a lot of knowledge about things that neither me nor Nigel, AFAICT, are very
> familiar with, like writing device drivers.

Yes.

Thanks for this email. It's really encouraging, and I'm more than glad
to work with you. Unfortunately, as you've seen me keep saying already,
I have very limited time to work on this. Thankfully you seem to have
more, and Pekka has also stepped up to help, so maybe we can make good
forward progress despite my limitations.

Regards,

Nigel

signature.asc

Pavel Machek

unread,

Apr 26, 2007, 6:50:09 PM4/26/07

to

Hi!

> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
>
> I'd actually like to discuss this a bit..
>
> I'm obviously not a huge fan of the whole user/kernel level split and
> interfaces, but I actually do think that there is *one* split that makes
> sense:
>
> - generate the (whole) snapshot image entirely inside the kernel
>
> - do nothing else (ie no IO at all), and just export it as a single image
> to user space (literally just mapping the pages into user space).
> *one* interface. None of the "pretty UI update" crap. Just a single
> system call:
>
> void *snapshot_system(u32 *size);
>
> which will map in the snapshot, return the mapped address and the size
> (and if you want to support snapshots > 4GB, be my guest, but I suspect
> you're actually *better* off just admitting that if you cannot shrink
> the snapshot to less than 32 bits, it's not worth doing)

This is basically how uswsusp is designed. (We do not use system call,
you just read from /dev/snapshot, and you have to make few ioctls to
stop the other tasks).

> and for testing, you should be able to basically do
>
> u32 size;
> void *buffer = snapshot_system(&size);
> if (buffer != MAP_FAILED)
> resume_snapshot(buffer, size);
>
> and it should obviously work.

Which is what I did long time ago, during uswsusp development.

> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting, although obviously the snapshotting will have tried to page
> a lot of stuff out in order to make the snapshot smaller, so you'll likely
> be crawling.

Well... We decided not to do this in the fully working system. SIGSTOP
is just not strong enough, and we want the snapshot atomic.

Now, it would be _very_ nice to be able to snapshot system and
continue running, but I just don't see how to do it without extensive
filesystem support.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Pavel Machek

unread,

Apr 26, 2007, 6:50:13 PM4/26/07

to

Hi!

> I'd really suggest _just_ the "full image". Nothing else is probably ever
> worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as
> "echo disk > /sys/power/state", but it should not necessarily be much
> worse than
>
> snapshot_kernel | gzip -9 > /dev/snapshot

Yep, we "freeze too much", so we can't just use the shell and pipe
it. Too bad.

218 int write_image(char *resume_dev_name)
219 {
220 static struct swap_map_handle handle;
221 struct swsusp_info *header;
222 unsigned long start;
223 int fd;
224 int error;
225
226 fd = open(resume_dev_name, O_RDWR | O_SYNC);
227 if (fd < 0) {
228 printf("suspend: Could not open resume device\n");
229 return error;
230 }
231 error = read(dev, buffer, PAGE_SIZE);
232 if (error < PAGE_SIZE)
233 return error < 0 ? error : -EFAULT;
234 header = (struct swsusp_info *)buffer;
235 if (!enough_swap(header->pages)) {
236 printf("suspend: Not enough free swap\n");
237 return -ENOSPC;
238 }
239 error = init_swap_writer(&handle, fd);
240 if (!error) {
241 start = handle.cur_swap;
242 error = swap_write_page(&handle, header);
243 }
244 if (!error)
245 error = save_image(&handle, header->pages - 1);
246 if (!error) {
247 flush_swap_writer(&handle);
248 printf( "S" );
249 error = mark_swap(fd, start);
250 printf( "|\n" );
251 }
252 fsync(fd);
253 close(fd);
254 return error;
255 }

This is basically the loop above, made complex by the fact that we do
not want to have separate partition for snapshot; we just want to
reuse free space in swap partition.

I think you've just invented uswsusp.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

David Lang

unread,

Apr 26, 2007, 7:00:10 PM4/26/07

to

On Fri, 27 Apr 2007, Pavel Machek wrote:

> This is basically the loop above, made complex by the fact that we do
> not want to have separate partition for snapshot; we just want to
> reuse free space in swap partition.

with the size of drives today is it really that bad to require a seperate
partition for this?

I also don't like the idea of storing this in the swap partition for a couple of
reasons.

1. on many modern linux systems the swap partition is not large enough.

for example, on my boxes with 16G or ram I only allocate 2G of swap space

2. it's too easy for other things to stomp on your swap partition.

for example: booting from a live CD that finds and uses swap partitions

if you are needing space for your freeze, allocate it in an unabigous way, not
by re-useing an existing partition.

David Lang

Linus Torvalds

unread,

Apr 26, 2007, 7:20:06 PM4/26/07

to

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>
> Well, I think that much of what Linus is saying indicates that he hasn't tried
> to write any such thing himself. ;-)

That's definitely true. The only interaction I ever had with "hibernation"
(and yes, we should just call it that) is when I was working on s2ram and
cleaning up the PCI device suspend/resume in particular, and trying
(_mostly_ successfully - I think I broke it once or twice mainly due to
interactions with the console, but on the whole I think it mostly worked)
to not break hibernation in the process without actually running it.

> Now that I know much more than before, I can say I agree with Linus on his
> opinion about the separation of s2ram form the snapshot/restore functionality
> (I'll call it 'hibernation' for simplicity from now on).

So my strong opinion on it literally comes from the other end (ie _not_
knowing about hibernation, but trying to work with s2ram, and cursing the
mixups).

> It should be done, because it would make things simpler and cleaner.
> Still, it will be difficult to do without screwing users en masse and
> that's my main concern here.

I do agree. It will inevitably affect a lot of devices. That's always
painful.

> I don't agree that we don't need the tasks freezer for suspending and
> hibernation. We need it, because we need to be sure that the (other) tasks
> will not get us in the way, and that also applies to kernel threads (and I
> don't think the tasks freezer is 'screwing' them, BTW).

I actually feel much less strongly about that, because just separating out
s2ram and hibernate entirely from each other would already really get the
thing _I_ care about taken care of - being able to work on one of the
other without fear of breaking the other one.

And besides, I actually came into the whole discussion because I'm not a
huge fan of thinking that user-land is "better". If the thing can sanely
be done in kernel, I'm actually all for that. What drives me wild is
having three different things, and nobody driving.

It needs somebody who (a) cares (b) has good taste and (c) has enough time
and personal karma to burn that he can actually take the (obviously)
inevitable heat from just doing things right, and convincing people to
select *one* implementation.

That kind of person is really really hard to find. And if you're it,
you're in for some pain ;)

Linus

Pavel Machek

unread,

Apr 26, 2007, 7:20:09 PM4/26/07

to

Hi!

> >This is basically the loop above, made complex by the fact that we do
> >not want to have separate partition for snapshot; we just want to
> >reuse free space in swap partition.
>
> with the size of drives today is it really that bad to require a seperate
> partition for this?

Yes. You want uswsusp to work in situations where swsusp worked.

> I also don't like the idea of storing this in the swap partition for a
> couple of reasons.
>
> 1. on many modern linux systems the swap partition is not large enough.
>
> for example, on my boxes with 16G or ram I only allocate 2G of swap
> space

WTF? So allocate larger swap partition. You just told me disks are big
enough.

> 2. it's too easy for other things to stomp on your swap partition.
>
> for example: booting from a live CD that finds and uses swap
> partitions

That's a feature. If you are booting from live CD, you _want_ to erase
any hibernation image.

> if you are needing space for your freeze, allocate it in an unabigous way,
> not by re-useing an existing partition.

Of course you have that option. Writing image is done in userspace, so
you are free to write it to raw partition (and first versions indeed
done that).

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Pavel Machek

unread,

Apr 26, 2007, 7:30:16 PM4/26/07

to

Hi!

> >That's a feature. If you are booting from live CD, you _want_ to erase
> >any hibernation image.
>

> why?
>
> it's been stated that doing a std and booting another OS (including
> windows) is a valid and common useage. saying that if you boot another OS
> you trash your suspended image doesn't sound reasonable.

If you hibernate your machine, boot from live cd, and change anything
on any filesystem, you are pretty likely to loose that filesystem.

Doing that with Windows is okay as Windows do not usually write to
ext3 partitions.

David Lang

unread,

Apr 26, 2007, 7:30:19 PM4/26/07

to

On Fri, 27 Apr 2007, Pavel Machek wrote:

> Hi!
>
>>> This is basically the loop above, made complex by the fact that we do
>>> not want to have separate partition for snapshot; we just want to
>>> reuse free space in swap partition.
>>
>> with the size of drives today is it really that bad to require a seperate
>> partition for this?
>
> Yes. You want uswsusp to work in situations where swsusp worked.
>
>> I also don't like the idea of storing this in the swap partition for a
>> couple of reasons.
>>
>> 1. on many modern linux systems the swap partition is not large enough.
>>
>> for example, on my boxes with 16G or ram I only allocate 2G of swap
>> space
>
> WTF? So allocate larger swap partition. You just told me disks are big
> enough.

swap partitions are limited to 2G (or at least they were a couple of months ago
when I last checked). I also don't want to run the risk of having a box try to
_use_ 16G worth of swap. I'd rather have the box hit OOM first.

>> 2. it's too easy for other things to stomp on your swap partition.
>>
>> for example: booting from a live CD that finds and uses swap
>> partitions
>
> That's a feature. If you are booting from live CD, you _want_ to erase
> any hibernation image.

why?

it's been stated that doing a std and booting another OS (including windows) is
a valid and common useage. saying that if you boot another OS you trash your
suspended image doesn't sound reasonable.

David Lang

unread,

Apr 26, 2007, 7:40:08 PM4/26/07

to

On Fri, 27 Apr 2007, Pavel Machek wrote:

> Hi!
>
>>> That's a feature. If you are booting from live CD, you _want_ to erase
>>> any hibernation image.
>>
>> why?
>>
>> it's been stated that doing a std and booting another OS (including
>> windows) is a valid and common useage. saying that if you boot another OS
>> you trash your suspended image doesn't sound reasonable.
>
> If you hibernate your machine, boot from live cd, and change anything
> on any filesystem, you are pretty likely to loose that filesystem.

booting from a live CD doesn't mean that you are going to mount the filesystem,
let alone change it. but swap is not supposed to be this sensitive.

David Lang

> Doing that with Windows is okay as Windows do not usually write to
> ext3 partitions.
> Pavel
>

Olivier Galibert

unread,

Apr 26, 2007, 8:20:09 PM4/26/07

to

On Fri, Apr 27, 2007 at 06:50:56AM +1000, Nigel Cunningham wrote:
> I'm perfectly willing to think through some alternate approach if you
> suggest something or prod my thinking in a new direction, but I'm afraid
> I just can't see right now how we can achieve what you're after.

Ok, what about this approach I've been mulling about for a while:

Suspend-to-disk is pretty much an exercise in state saving. There are
multiple ways to do state saving, but they tend to end up in two
categories: implicit and explicit.

In implicit state saving, you try to save the state of the
system/application/whatever "under its feet", more or less, and then
fixup what is no saved/saveable correctly. A well-known example is
the undumping process Emacs goes (went?) where it tries to dump the
state of the memory as a new executable, with a lot of pleasure with
various executable formats and subtleties due to side effects in libc
code you don't control.

In explicit state saving each object saves what is needed from its
state to an independently defined format (instead of "whatever the
memory organization happens to be at that point"). When reloading the
state you have to parse it, and it usually requires
rebuilding/relocating all references/pointers/etc. XEmacs currently
has a "portable dumper" that pretty much does just that. We don't
have any redumping problems anymore, they're over.

Which one is the best depends heavily on the application. The amount
of code in the implicit case depends on the amount of fixups to do.
In the kernel case it happens to be a lot, pretty much everything that
touches hardware has to save to memory the device state and reload it
on resume. And bugs on hardware handling can be quite annoying to
debug. And if some driver does not to saving/resume correctly, you
have no way outside of playing with modules to ensure the safety of
the suspend cycle.

The amount of code in the explicit case is an interesting variable in
the case of the kernel. You have to save what is needed, but how do
you define what is needed? It is, pretty much, what running processes
can observe from userspace. Now, what can a process observe:
- its application text and anonymous memory pages
- its file handles
- its mapped files
- its mapped whatever else
- its sys5 IPC stuff
- futex stuff and friends, namespaces, etc
- its intrinsic characteristics it can reach through syscalls
(i.e. the user-visible parts of current, like pid, uid...)
- its currently running system call, if any

So that's what we'd have to explicitely save. Anonymous memory, sys5
IPC, futex and current structures, that's easy stuff in practice. The
fun part are pretty much:
- references to files
- references to active networking links
- references to devices and associated visible state
- currently running system call, aka the kernel stack for the process

The last one is the one I'm the most afraid of. I hope that the
signal stuff and/or the asynchronous syscall stuff that was discussed
recently would allow to "unwind" blocking system calls back to the
syscall level and then store the parameters for resume-time restart.
The non-blocking calls you can just let finish.

The first one is really interesting. If you value your filesystems,
you'd rather have them clean after the suspend. And also you pretty
much know that filesystems can move around when you're not looking, be
it USB hotplug stuff (discovery order is random-ish isn't it?), module
loading order issues or multithreaded device discovery. So you're way
more happy *not* caching anything from the filesystem you can avoid.

But what is a file reference, really? With the dcache handy, it's
pretty much a path, since inodes don't always exist reliably. And if
you have the lists of paths used by the processes on a particular
filesystem, you can easily get an idea of where, if anywhere, the
filesystem is even if you don't have reliable serials. More
interestingly, you cannot, in any case, instantly corrupt your
filesystem by having a mismatch between the in-memory cache and the
reality.

The processes which referenced files you can't find anywhere will
end-up with EBADF or segfault depending on whether it was fd or mmap,
ala revoke(). They'll probably die horribly. I'd rather have
processes die than filesystems die, since in any case if the file
isn't here anymore in practice the process could only destroy things.

An interesting things there, nothing in that touches either the
filesystem or the block devices. Everything is done at the VFS level.
The devices don't need to care. And the "this filesystem goes there"
can be done in userspace in an initramfs if people want to experiment
with kinky strategies. After all, why not allow a sysadmin to regroup
two filesystems into one though a suspend, the processes mostly don't
need to care (well, tar may, but heh). Deleted files would have to be
sillyrenamed or something. Implementation details ;-)

Active networking links, you can consider them dead for a start. The
networking guys can play with keepalives and stuff if they want to in
a second step. Network seldom survives suspend anyway, too many
timeouts involved, especially with dynamic IPs.

That leaves references to devices. null, ptys, random, log are not a
problem, they're virtual constructs. In a first approximation you can
revoke() the rest brutally. On a "standard" system that will kill X
(ouch), GPM and other input-interested devices, and everything with an
opened sound device. Then you can add explicit state saving support
to the devices you want, one by one. It may be possible to handle
sound collectively at the ALSA layer level, I don't really know.
Input shouldn't be too hard, not much state to save, X will be a pain
and will probably need special casing. X is a big special case
anyway, no matter what happens.

For the less directly used devices you can always all explicit support
when you feel like it. The interesting part is that either the device
supports the suspend and says so explicitely, or the process can't
access the device anymore using the previous fds/mmaps after resume.
No weird half-condition. If (very) resilient, the process can even
close, reopen, reconfigure and go on its merry way.

And if you design the saving format correctly (attribute name/value
pairs as text work beautifully for such a case), you can be resilient
to extreme things including kernel version change or rsync-ing / and
the state file and resuming in another box. And if a device gets
something it can't parse as the state to go back to for a given
fd/mmap for a process, it can always revoke() that one and go on.

The main point of that kind of state-saving is to be
trustable-by-design. For each process, either its environment could
be restored correctly or the incorrect parts can not be accessed
anymore. And the stability of the system and its filesystems is
ensured pretty much whatever happens.

There are a billion details to take into account in a real
implementation, but I'm sure you can get the gist of the idea.

OG.

Olivier Galibert

unread,

Apr 26, 2007, 8:30:10 PM4/26/07

to

On Thu, Apr 26, 2007 at 03:49:51PM -0700, David Lang wrote:
> swap partitions are limited to 2G (or at least they were a couple of months
> ago when I last checked). I also don't want to run the risk of having a box
> try to _use_ 16G worth of swap. I'd rather have the box hit OOM first.

They aren't limited anymore, I have a number of machines with 20G swap
for experiments.

OG.

Pekka J Enberg

unread,

Apr 27, 2007, 1:00:13 AM4/27/07

to

On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> > which will map in the snapshot, return the mapped address and the size
> > (and if you want to support snapshots > 4GB, be my guest, but I suspect
> > you're actually *better* off just admitting that if you cannot shrink
> > the snapshot to less than 32 bits, it's not worth doing)

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> That inherently limits the image to half of available ram (you need
> somewhere to store the snapshot), so you won't get the full image you
> express interest in below.

It doesn't. We can make the userspace mapped pages copy-on-write. As long
as the userspace makes sure there's not much activity during
snapshot/shutdown, we will be fine. What we probably do need to copy is
kernel pages.

Pekka

Pekka Enberg

unread,

Apr 27, 2007, 1:50:27 AM4/27/07

to

On 4/27/07, Pavel Machek <pa...@ucw.cz> wrote:
> Now, it would be _very_ nice to be able to snapshot system and
> continue running, but I just don't see how to do it without extensive
> filesystem support.

So what kind of support do we need from the filesystem?

Pekka

Nigel Cunningham

unread,

Apr 27, 2007, 2:10:13 AM4/27/07

to

Hi.

On Fri, 2007-04-27 at 07:52 +0300, Pekka J Enberg wrote:
> On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> > > which will map in the snapshot, return the mapped address and the size
> > > (and if you want to support snapshots > 4GB, be my guest, but I suspect
> > > you're actually *better* off just admitting that if you cannot shrink
> > > the snapshot to less than 32 bits, it's not worth doing)
>
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > That inherently limits the image to half of available ram (you need
> > somewhere to store the snapshot), so you won't get the full image you
> > express interest in below.
>
> It doesn't. We can make the userspace mapped pages copy-on-write. As long
> as the userspace makes sure there's not much activity during
> snapshot/shutdown, we will be fine. What we probably do need to copy is
> kernel pages.

COW is a possibility, but I understood (perhaps wrongly) that Linus was
thinking of a single syscall or such like to prepare the snapshot. If
you're going to start doing things like this, won't that mean you'd then
have to update/redo the snapshot or somehow nullify the effect of
anything the programs does so that doing it again after the snapshot is
restored doesn't cause problems?

I was going to leave it at that and press send, but perhaps that
wouldn't be wise. I feel I should also ask what you're thinking of as a
means of making sure userspace doesn't do much activity.

Thanks for your labours!

Regards,

Nigel

signature.asc

Pekka J Enberg

unread,

Apr 27, 2007, 2:20:07 AM4/27/07

to

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> COW is a possibility, but I understood (perhaps wrongly) that Linus was
> thinking of a single syscall or such like to prepare the snapshot. If
> you're going to start doing things like this, won't that mean you'd then
> have to update/redo the snapshot or somehow nullify the effect of
> anything the programs does so that doing it again after the snapshot is
> restored doesn't cause problems?

No. The snapshot is just that. A snapshot in time. From kernel point of
view, it doesn't matter one bit what when you did it or if the state has
changed before you resume. It's up to userspace to make sure the user
doesn't do real work while the snapshot is being written to disk and
machine is shut down.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> I was going to leave it at that and press send, but perhaps that
> wouldn't be wise. I feel I should also ask what you're thinking of as a
> means of making sure userspace doesn't do much activity.

When the snapshot pages are COW, we will run out of memory if userspace
writes to those pages too much. If userspace is blocked, say like
displaying a "we are suspending" in X which blocks the user from using
other programs that could generate new writes and mounting filesystems
read-only, we don't need to worry about running out of memory.

Pekka J Enberg

unread,

Apr 27, 2007, 2:30:12 AM4/27/07

to

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?

On Fri, 27 Apr 2007, Pekka J Enberg wrote:
> No. The snapshot is just that. A snapshot in time. From kernel point of
> view, it doesn't matter one bit what when you did it or if the state has
> changed before you resume. It's up to userspace to make sure the user
> doesn't do real work while the snapshot is being written to disk and
> machine is shut down.

Btw, obviously we need to break the COW when resuming and not include the
snapshot mapping. However, that should be trivially doable by snapshotting
the page mappings before remapping them as COW.

Nigel Cunningham

unread,

Apr 27, 2007, 2:40:07 AM4/27/07

to

Hi.

On Fri, 2007-04-27 at 09:18 +0300, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
>
> No. The snapshot is just that. A snapshot in time. From kernel point of
> view, it doesn't matter one bit what when you did it or if the state has
> changed before you resume. It's up to userspace to make sure the user
> doesn't do real work while the snapshot is being written to disk and
> machine is shut down.

Sorry Pekka, but that's just broken.

It implies firstly that we tell all userspace programs "I'm sorry, but
I'm suspending at the moment. Can you tip toe quietly around while I do
it?" You can't seriously expect every userspace program to be modified
to adjust it's behaviour according to whether we're writing a snapshot
to disk at the moment or not.

It also implies that we can prepare a snapshot and then happily have the
contents of the disk change so that they don't match the superblock and
other filesystem details we just saved in the snapshot. We can't. At
least not without modifying all the filesystems so that (at a minimum)
they know how to throw away all the metadata they have at resume time
and reread it from disk.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > I was going to leave it at that and press send, but perhaps that
> > wouldn't be wise. I feel I should also ask what you're thinking of as a
> > means of making sure userspace doesn't do much activity.
>
> When the snapshot pages are COW, we will run out of memory if userspace
> writes to those pages too much. If userspace is blocked, say like
> displaying a "we are suspending" in X which blocks the user from using
> other programs that could generate new writes and mounting filesystems
> read-only, we don't need to worry about running out of memory.

This sounds feasible, but it's only really acceptable if your willing to
have hibernation fail or restart multiple times. If your battery is
running out or you need to rush to put a lappy in your bag because they
train just came early, that's not an option. It's for that very reason
that I've put a lot of effort into trying to make it work first time,
every time. Not there yet, but it's a priority.

By the way, sorry. This email feels like it is pouring a lot of cold
water on your ideas. I don't want to be negative!

Regards,

Nigel

signature.asc

Pekka J Enberg

unread,

Apr 27, 2007, 3:00:12 AM4/27/07

to

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Sorry Pekka, but that's just broken.

It certainly isn't.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> It implies firstly that we tell all userspace programs "I'm sorry, but
> I'm suspending at the moment. Can you tip toe quietly around while I do
> it?" You can't seriously expect every userspace program to be modified
> to adjust it's behaviour according to whether we're writing a snapshot
> to disk at the moment or not.

You don't need to modify other programs. You just need to display the
progress bar and block _user input_. I don't even claim to know X, but I
would be extremely surprised if you technically can't say "don't let
the user touch any other windows except this one." The user couldn't care
less whether tasks are frozen or not by the kernel. What matters is that
the user can't shoot himself in the foot while snapshotting.

Furthermore, we probably do need to do other things to ensure safety, like
remounting filesystems read-only but again, this has nothing to do with
snapshotting per se. What the kernel needs to worry about is (1) providing
an atomic snapshot that is consistent and (2) resuming to that snapshot
safely. If the _user_ loses data that was generated between snapshot +
shutdown, it's absolutely no concern for the snapshot operation!

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> It also implies that we can prepare a snapshot and then happily have the
> contents of the disk change so that they don't match the superblock and
> other filesystem details we just saved in the snapshot. We can't. At
> least not without modifying all the filesystems so that (at a minimum)
> they know how to throw away all the metadata they have at resume time
> and reread it from disk.

But you just explained how we can! We shouldn't bend over backwards for
snapshotting just because the filesystems don't currently support
something we need.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> By the way, sorry. This email feels like it is pouring a lot of cold
> water on your ideas. I don't want to be negative!

Don't worry, I am used to cold water :-).

Nigel Cunningham

unread,

Apr 27, 2007, 3:10:10 AM4/27/07

to

Hi.

On Fri, 2007-04-27 at 09:50 +0300, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > Sorry Pekka, but that's just broken.
>
> It certainly isn't.
>
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > It implies firstly that we tell all userspace programs "I'm sorry, but
> > I'm suspending at the moment. Can you tip toe quietly around while I do
> > it?" You can't seriously expect every userspace program to be modified
> > to adjust it's behaviour according to whether we're writing a snapshot
> > to disk at the moment or not.
>
> You don't need to modify other programs. You just need to display the
> progress bar and block _user input_. I don't even claim to know X, but I
> would be extremely surprised if you technically can't say "don't let
> the user touch any other windows except this one." The user couldn't care
> less whether tasks are frozen or not by the kernel. What matters is that
> the user can't shoot himself in the foot while snapshotting.

User input doesn't account for all system activity. Think of cron jobs
or user initiated jobs that may have started before the cycle began.

> Furthermore, we probably do need to do other things to ensure safety, like
> remounting filesystems read-only but again, this has nothing to do with
> snapshotting per se. What the kernel needs to worry about is (1) providing
> an atomic snapshot that is consistent and (2) resuming to that snapshot
> safely. If the _user_ loses data that was generated between snapshot +
> shutdown, it's absolutely no concern for the snapshot operation!

Noooo! If the user looses data, the user will be concerned and we should
be. I for one would do my best to avoid using software that loses my
data for me. I wouldn't care if you said "Well, it's your fault. You
lost the data." From my perspective as a user, I didn't lose the data,
some part of the computer's OS did.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > It also implies that we can prepare a snapshot and then happily have the
> > contents of the disk change so that they don't match the superblock and
> > other filesystem details we just saved in the snapshot. We can't. At
> > least not without modifying all the filesystems so that (at a minimum)
> > they know how to throw away all the metadata they have at resume time
> > and reread it from disk.
>
> But you just explained how we can! We shouldn't bend over backwards for
> snapshotting just because the filesystems don't currently support
> something we need.

Sorry, but I just don't believe filesystems should need to throw away
metadata post resume. If we let data be changed after snapshotting (or
ourselves cause it to be changed), we're the ones that are broken. Our
snapshot is out of date and the expectations of userspace programs that
were snapshotted will be out of date. Just imagine, for example, a
userspace program that is snapshotted, then reads and deletes a
temporary file. After the snapshot restore, it's running again. But
wait, we can't read or delete the file again because it's already gone.
Life just gets more complicated and confusing this way.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > By the way, sorry. This email feels like it is pouring a lot of cold
> > water on your ideas. I don't want to be negative!
>
> Don't worry, I am used to cold water :-).

Maybe, but I'd still rather be encouraging!

Nigel

signature.asc

Pekka J Enberg

unread,

Apr 27, 2007, 3:30:09 AM4/27/07

to

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> User input doesn't account for all system activity. Think of cron jobs
> or user initiated jobs that may have started before the cycle began.

Yes, but the _user_ did not start them so they didn't lose any work. See,
it might or might not be important but that's something the _userspace_
has much more knowledge than the kernel ever will.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Noooo! If the user looses data, the user will be concerned and we should
> be. I for one would do my best to avoid using software that loses my
> data for me. I wouldn't care if you said "Well, it's your fault. You
> lost the data." From my perspective as a user, I didn't lose the data,
> some part of the computer's OS did.

You are looking at snapshot/shutdown from kernel and user experience point
of view at the same time which causes confusion here.

Let me repeat: it is _absolutely no concern_ of the _kernel_ whether you
resume to a snapshot that does not contain all your precious data. The
kernel doesn't care one bit!

That being said, the _userspace solution_ obviously needs to take this
into account by blocking user input, making filesystems read-only, and
maybe even blocking certain background processes (cron and beagle indexing
come into mind).

It doesn't. We can either make the filesystem read-only or, surprise,
surprise, make a _snapshot_ of the filesystem!

And while the points you raised are important for the full
end-user solution, it is absolutely not interesting to snapshot_system().
The only thing it needs to guarantee is a consistent snapshot that we can
resume later.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Maybe, but I'd still rather be encouraging!

You are. Perhaps you just don't know it yet. ;-)

Pekka Enberg

unread,

Apr 27, 2007, 4:00:22 AM4/27/07

to

On 4/26/07, Linus Torvalds <torv...@linux-foundation.org> wrote:
> In fact, I personally feel that I shouldn't even have merged
> userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
> feelings simply don't matter that much. I have to trust people. But yes,
> as far as *I* am personally concerned, I think it was a mistake to merge
> it.

While the ioctl() interface is horrid, I think it's actually in
principle pretty close to your snapshot_system()/resume_snapshot().
The ugliness probably comes from the fact that suspend to RAM and
snapshot/shutdown are interleaved there too.

Oliver Neukum

unread,

Apr 27, 2007, 6:10:16 AM4/27/07

to

Am Freitag, 27. April 2007 08:18 schrieb Pekka J Enberg:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
>
> No. The snapshot is just that. A snapshot in time. From kernel point of
> view, it doesn't matter one bit what when you did it or if the state has
> changed before you resume. It's up to userspace to make sure the user
> doesn't do real work while the snapshot is being written to disk and
> machine is shut down.

And where is the benefit in that? How is such user space freezing logic
simpler than having the kernel do the write?
What can you do in user space if all filesystems are r/o that is worth the
hassle?

Regards
Oliver

Christoph Hellwig

unread,

Apr 27, 2007, 7:33:00 AM4/27/07

to

On Thu, Apr 26, 2007 at 05:38:07PM -0400, Theodore Tso wrote:
> On Fri, Apr 27, 2007 at 06:08:01AM +1000, Nigel Cunningham wrote:
> > We tried that. It would need some work. IIRC remounting filesystems
> > read-only makes files become marked read-only. Perfectly sensible,
> > except that if you then remount the filesystem rw at resume time, all
> > those files are still marked ro and userspace crashes and burns. Not
> > unfixable, I'll agree, but there is more work to do there.
>
> There are other solutions, though. One is that we could export a
> system call interface which freezes a filesystem and prevents any
> further I/O. We mostly have something like that right now (via the
> the write_super_lockfs function in the superblock operations
> structure), but we haven't exported it to userspace.

It is exported on XFS ;-)

> We would also need a similar interface to freeze any block device I/O,
> in case you have a database running and doing direct I/O to a block
> device. (Or again, we could simply not support that case; how many
> people will be running running a database accessing a block deivce on
> their laptop?)

block device I/O uses generic_file*whateveriscurrenthere*_write, which
checks for the freeze flag, so the infrastructure for that is there
aswell.

Daniel Pittman

unread,

Apr 27, 2007, 7:33:16 AM4/27/07

to

Olivier Galibert <gali...@pobox.com> writes:
> On Fri, Apr 27, 2007 at 06:50:56AM +1000, Nigel Cunningham wrote:
>
>> I'm perfectly willing to think through some alternate approach if you
>> suggest something or prod my thinking in a new direction, but I'm
>> afraid I just can't see right now how we can achieve what you're
>> after.
>
> Ok, what about this approach I've been mulling about for a while:
>
> Suspend-to-disk is pretty much an exercise in state saving. There are
> multiple ways to do state saving, but they tend to end up in two
> categories: implicit and explicit.

[...]

> In explicit state saving each object saves what is needed from its
> state to an independently defined format (instead of "whatever the
> memory organization happens to be at that point"). When reloading the
> state you have to parse it, and it usually requires
> rebuilding/relocating all references/pointers/etc.

If you are looking seriously at this you might want to start with the
code in the OpenVZ kernel (http://openvz.org) that allows a VE to
"checkpoint" to disk and "restore" on the same or a different machine.

This is, as far as I can tell, a portable implementation of this that
already handles real live userspace applications moving transparently
between two machines.

It has the advantage that it lives in an orderly world where most
devices and the file system are virtual but, hey, it works right now.

Regards,
Daniel
--
Digital Infrastructure Solutions -- making IT simple, stable and secure
Phone: 0401 155 707 email: con...@digital-infrastructure.com.au
http://digital-infrastructure.com.au/

Pekka J Enberg

unread,

Apr 27, 2007, 7:34:36 AM4/27/07

to

Am Freitag, 27. April 2007 08:18 schrieb Pekka J Enberg:
> > No. The snapshot is just that. A snapshot in time. From kernel point of
> > view, it doesn't matter one bit what when you did it or if the state has
> > changed before you resume. It's up to userspace to make sure the user
> > doesn't do real work while the snapshot is being written to disk and
> > machine is shut down.

On Fri, 27 Apr 2007, Oliver Neukum wrote:
> And where is the benefit in that? How is such user space freezing logic
> simpler than having the kernel do the write?
>
> What can you do in user space if all filesystems are r/o that is worth the
> hassle?

I am talking about snapshot_system() here. It's not given that the
filesystems need to be read-only (you can snapshot them too). The benefit
here is that you can do whatever you want with the snapshot (encrypt,
compress, send over the network) and have a clean well-defined interface
in the kernel. In addition, aborting the snapshot is simpler, simply
munmap() the snapshot.

The problem with writing in the kernel is obvious: we need to add new code
to the kernel for compression, encryption, and userspace interaction
(graphical progress bar) that are important for user experience.

Pekka

Pavel Machek

unread,

Apr 27, 2007, 9:00:14 AM4/27/07

to

Hi!

> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
>
> I'd actually like to discuss this a bit..
>
> I'm obviously not a huge fan of the whole user/kernel level split and
> interfaces, but I actually do think that there is *one* split that makes
> sense:
>
> - generate the (whole) snapshot image entirely inside the kernel
>
> - do nothing else (ie no IO at all), and just export it as a single image
> to user space (literally just mapping the pages into user space).
> *one* interface. None of the "pretty UI update" crap. Just a single
> system call:
>
> void *snapshot_system(u32 *size);
>

> which will map in the snapshot, return the mapped address and the size
> (and if you want to support snapshots > 4GB, be my guest, but I suspect
> you're actually *better* off just admitting that if you cannot shrink
> the snapshot to less than 32 bits, it's not worth doing)

I think this is very similar to current uswsusp design; except that we
are using read on /dev/snapshot to read the snapshot (not memory
mapping) and that we freeze the system (because I do not think killall
_SIGSTOP is enough).

Can you confirm that it is indeed similar design, or tell me why I'm
wrong? You had some pretty strong words for uswsusp before, so I'd
like to understand your position here. ("Ouch, I do not know, I am out
of time" is still better reply than silence.)

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Pavel Machek

unread,

Apr 27, 2007, 11:00:18 AM4/27/07

to

On Fri 2007-04-27 08:41:56, Pekka Enberg wrote:
> On 4/27/07, Pavel Machek <pa...@ucw.cz> wrote:
> >Now, it would be _very_ nice to be able to snapshot system and
> >continue running, but I just don't see how to do it without extensive
> >filesystem support.
>
> So what kind of support do we need from the filesystem?

"forcedremount ro, not telling anyone, not killing processes" would do
the trick. FS snapshots might do.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Oliver Neukum

unread,

Apr 27, 2007, 3:20:06 PM4/27/07

to

Am Freitag, 27. April 2007 12:12 schrieb Pekka J Enberg:
> I am talking about snapshot_system() here. It's not given that the
> filesystems need to be read-only (you can snapshot them too). The benefit
> here is that you can do whatever you want with the snapshot (encrypt,
> compress, send over the network) and have a clean well-defined interface
> in the kernel. In addition, aborting the snapshot is simpler, simply
> munmap() the snapshot.

But is that worth the trade off?

> The problem with writing in the kernel is obvious: we need to add new code
> to the kernel for compression, encryption, and userspace interaction
> (graphical progress bar) that are important for user experience.

The kernel can already do compression and encryption.

Regards
Oliver

Rafael J. Wysocki

unread,

Apr 27, 2007, 5:30:15 PM4/27/07

to

On Friday, 27 April 2007 06:52, Pekka J Enberg wrote:
> On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> > > which will map in the snapshot, return the mapped address and the size
> > > (and if you want to support snapshots > 4GB, be my guest, but I suspect
> > > you're actually *better* off just admitting that if you cannot shrink
> > > the snapshot to less than 32 bits, it's not worth doing)
>
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > That inherently limits the image to half of available ram (you need
> > somewhere to store the snapshot), so you won't get the full image you
> > express interest in below.
>
> It doesn't. We can make the userspace mapped pages copy-on-write. As long
> as the userspace makes sure there's not much activity during
> snapshot/shutdown, we will be fine. What we probably do need to copy is
> kernel pages.

The user space is (and IMHO should be) frozen way before that and what you're
suggesting here is what I wanted to implement some time ago. The problem with
this was that the user space pages may be updated, for example, by device
drivers as a result of some deferred I/O after we've snapshotted the system.

I didn't know how to find out which pages owned by the user space could be
updated this way, so I gave up at that time.

Greetings,
Rafael

Rafael J. Wysocki

unread,

Apr 27, 2007, 5:30:18 PM4/27/07

to

On Friday, 27 April 2007 08:18, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
>
> No. The snapshot is just that. A snapshot in time. From kernel point of
> view, it doesn't matter one bit what when you did it or if the state has
> changed before you resume. It's up to userspace to make sure the user
> doesn't do real work while the snapshot is being written to disk and
> machine is shut down.

Why do you think that keeping the user space frozen after 'snapshot' is a bad
idea? I think that solves many of the problems you're discussing.

Greetings,
Rafael

Rafael J. Wysocki

unread,

Apr 27, 2007, 5:30:22 PM4/27/07

to

On Friday, 27 April 2007 14:49, Pavel Machek wrote:
> Hi!
>
> > > * Doing things in the right order? (Prepare the image, then do the
> > > atomic copy, then save).
> >
> > I'd actually like to discuss this a bit..
> >
> > I'm obviously not a huge fan of the whole user/kernel level split and
> > interfaces, but I actually do think that there is *one* split that makes
> > sense:
> >
> > - generate the (whole) snapshot image entirely inside the kernel
> >
> > - do nothing else (ie no IO at all), and just export it as a single image
> > to user space (literally just mapping the pages into user space).
> > *one* interface. None of the "pretty UI update" crap. Just a single
> > system call:
> >
> > void *snapshot_system(u32 *size);
> >
> > which will map in the snapshot, return the mapped address and the size
> > (and if you want to support snapshots > 4GB, be my guest, but I suspect
> > you're actually *better* off just admitting that if you cannot shrink
> > the snapshot to less than 32 bits, it's not worth doing)
>
> I think this is very similar to current uswsusp design; except that we
> are using read on /dev/snapshot to read the snapshot (not memory
> mapping) and that we freeze the system

Yes, it seems so.

> (because I do not think killall _SIGSTOP is enough).

Agreed.

Greetings,
Rafael

Nigel Cunningham

unread,

Apr 27, 2007, 5:50:10 PM4/27/07

to

Hi.

On Fri, 2007-04-27 at 16:55 +0200, Pavel Machek wrote:
> On Fri 2007-04-27 08:41:56, Pekka Enberg wrote:
> > On 4/27/07, Pavel Machek <pa...@ucw.cz> wrote:
> > >Now, it would be _very_ nice to be able to snapshot system and
> > >continue running, but I just don't see how to do it without extensive
> > >filesystem support.
> >
> > So what kind of support do we need from the filesystem?
>
> "forcedremount ro, not telling anyone, not killing processes" would do
> the trick. FS snapshots might do.

It sounds to me more like Pekka is thinking of checkpointing support. If
that's the case, then remounting filesystems isn't going to be an
option. You want to freeze them for just long enough so that you can
determine what needs saving in the checkpoint. You certainly don't want
to make rw file handles ro and so on.

Nigel

signature.asc

Linus Torvalds

unread,

Apr 27, 2007, 5:50:14 PM4/27/07

to

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>
> Why do you think that keeping the user space frozen after 'snapshot' is a bad
> idea? I think that solves many of the problems you're discussing.

It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do

gdb -p <snapshotter>

when something goes wrong?) but we also *depend* on user space for various
things (the same way we depend on kernel threads, and why it has been such
a total disaster to try to freeze the kernel threads too!). For example,
if you want to do graphical stuff, just using X would be quite nice,
wouldn't it?

But I do agree that doing everythign in the kernel is likely to just be a
hell of a lot simpler for everybody.

Linus

Nigel Cunningham

unread,

Apr 27, 2007, 6:10:08 PM4/27/07

to

Hi.

On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> >
> > Why do you think that keeping the user space frozen after 'snapshot' is a bad
> > idea? I think that solves many of the problems you're discussing.
>
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
>
> gdb -p <snapshotter>

Make the machine being suspended a VM and you can already do that.

> when something goes wrong?) but we also *depend* on user space for various
> things (the same way we depend on kernel threads, and why it has been such
> a total disaster to try to freeze the kernel threads too!). For example,
> if you want to do graphical stuff, just using X would be quite nice,
> wouldn't it?

It would be nice, yes.

But in doing so you make the contents of the disk inconsistent with the
state you've just snapshotted, leading to filesystem corruption. Even if
you modify filesystems to do checkpointing (which is what we're really
talking about), you still also have the problem that your snapshot has
to be stored somewhere before you write it to disk, so you also have to
either

1) write some known static memory to disk before the snapshot and reuse
it for the snapshot,
2) ensure up to half the RAM is free for your snapshot or
3) compress the snapshot as you take it, guessing beforehand how much
memory the compressed snapshot might take and freeing that might
4) reserve memory at boot time for the atomic copy so that 2) or 3) is
still done, but without having to free the memory. (Yuk!).

> But I do agree that doing everythign in the kernel is likely to just be a
> hell of a lot simpler for everybody.

Indeed.

Nigel

signature.asc

Rafael J. Wysocki

unread,

Apr 27, 2007, 6:10:12 PM4/27/07

to

On Friday, 27 April 2007 23:44, Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> >
> > Why do you think that keeping the user space frozen after 'snapshot' is a bad
> > idea? I think that solves many of the problems you're discussing.
>
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
>
> gdb -p <snapshotter>
>
> when something goes wrong?) but we also *depend* on user space for various
> things (the same way we depend on kernel threads, and why it has been such
> a total disaster to try to freeze the kernel threads too!).

We're freezing many of them just fine. ;-)

> For example, if you want to do graphical stuff, just using X would be quite
> nice, wouldn't it?

Yes, it would, but as long as we can't protect mounted filesystems from being
touched, it's just dangerous to let the user space run at that point.

> But I do agree that doing everythign in the kernel is likely to just be a
> hell of a lot simpler for everybody.

:-)

Greetings,
Rafael

Linus Torvalds

unread,

Apr 27, 2007, 6:20:07 PM4/27/07

to

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> We're freezing many of them just fine. ;-)

And can you name a _single_ advantage of doing so?

It so happens, that most people wouldn't notice or care that kmirrord got
frozen (kernel thread picked at random - it might be one of the threads
that has gotten special-cased to not do that), but I have yet to hear a
single coherent explanation for why it's actually a good idea in the first
place.

And it has added totally idiotic code to every single kernel thread main
loop. For _no_ reason, except that the concept was broken, and needed more
breakage to just make it work.

Linus

Rafael J. Wysocki

unread,

Apr 27, 2007, 6:40:15 PM4/27/07

to

On Saturday, 28 April 2007 00:08, Linus Torvalds wrote:
>
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > We're freezing many of them just fine. ;-)
>
> And can you name a _single_ advantage of doing so?

Yes. We have a lot less interdependencies to worry about during the whole
operation.

> It so happens, that most people wouldn't notice or care that kmirrord got
> frozen (kernel thread picked at random - it might be one of the threads
> that has gotten special-cased to not do that), but I have yet to hear a
> single coherent explanation for why it's actually a good idea in the first
> place.

Well, I don't know if that's a 'coherent' explanation from your point of view
(probably not), but I'll try nevertheless:
1) if the kernel threads are frozen, we know that they don't hold any locks
that could interfere with the freezing of device drivers,
2) if they are frozen, we know, for example, that they won't call user mode
helpers or do similar things,
3) if they are frozen, we know that they won't submit I/O to disks and
potentially damage filesystems (suspend2 has much more problems with that
than swsusp, but still. And yes, there have been bug reports related to it,
so it's not just my fantasy).

Probably some other people can say more about it.

> And it has added totally idiotic code to every single kernel thread main
> loop. For _no_ reason, except that the concept was broken, and needed more
> breakage to just make it work.

It is actually useful for some things other than the hibernation/suspend, the
code is not idiotic (it's one line of code in the majority of cases) and you
should take that "I hate everything even remotely related to hibernation" hat
off, really.

Greetings,
Rafael

David Lang

unread,

Apr 27, 2007, 6:50:08 PM4/27/07

to

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:

> On Friday, 27 April 2007 14:49, Pavel Machek wrote:
>>
>> I think this is very similar to current uswsusp design; except that we
>> are using read on /dev/snapshot to read the snapshot (not memory
>> mapping) and that we freeze the system
>
> Yes, it seems so.
>
>> (because I do not think killall _SIGSTOP is enough).
>

remember, this is being done inside the kernel. the kernel can do things like
saving off the scheduler queue to prevent any userspace from running during the
snapshot, it could then move selected pids over to a new queue to selectivly
'unfreeze' whatever you need (like the X processes for example) and then proceed
normally (allowing processes to be spawned, forked, etc without activiating the
rest of userspace becouse the rest just won't be available to be scheduled) and
userspace can tell the kernel the list of pids to unfreeze so the kernel doesn't
need to try and guess.

David Lang

unread,

Apr 27, 2007, 7:10:09 PM4/27/07

to

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:

>>> We're freezing many of them just fine. ;-)
>>
>> And can you name a _single_ advantage of doing so?
>
> Yes. We have a lot less interdependencies to worry about during the whole
> operation.
>
>> It so happens, that most people wouldn't notice or care that kmirrord got
>> frozen (kernel thread picked at random - it might be one of the threads
>> that has gotten special-cased to not do that), but I have yet to hear a
>> single coherent explanation for why it's actually a good idea in the first
>> place.
>
> Well, I don't know if that's a 'coherent' explanation from your point of view
> (probably not), but I'll try nevertheless:
> 1) if the kernel threads are frozen, we know that they don't hold any locks
> that could interfere with the freezing of device drivers,

does teh process of freezing really wait until all locks have been released?

> 2) if they are frozen, we know, for example, that they won't call user mode
> helpers or do similar things,

this won't matter unless the user mode helpers are going to do I/O or other
permanent changes

> 3) if they are frozen, we know that they won't submit I/O to disks and
> potentially damage filesystems (suspend2 has much more problems with that
> than swsusp, but still. And yes, there have been bug reports related to it,
> so it's not just my fantasy).

if you have the filesystems checkpointed then I/O after the freeze won't matter
as you just revert to the checkpoint (and since this is going to be thrown away
it can stay in ram)

if we are willing to make a break with the past to implement the new snapshot
capability, we should be able to use the LVM snapshot code to handle the
filesystem

David Lang

Linus Torvalds

unread,

Apr 27, 2007, 7:20:06 PM4/27/07

to

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> > And can you name a _single_ advantage of doing so?
>
> Yes. We have a lot less interdependencies to worry about during the whole
> operation.

That's not an advantage. That's why it has *sucked*.

Trying to freeze kernel threads has _caused_ problems. It has _added_
these interdependencies. It hasn't removed a single dependency at any
time, it has just added new problems!

> 1) if the kernel threads are frozen, we know that they don't hold any locks
> that could interfere with the freezing of device drivers,
> 2) if they are frozen, we know, for example, that they won't call user mode
> helpers or do similar things,
> 3) if they are frozen, we know that they won't submit I/O to disks and
> potentially damage filesystems (suspend2 has much more problems with that
> than swsusp, but still. And yes, there have been bug reports related to it,
> so it's not just my fantasy).

NONE of these are valid explanations at all. You're listing totally
theoretical problems, and ignoring all the _real_ problems that trying to
freeze kernel threads has _caused_.

If you want to control user-mode helpers, you do that - you do not freeze
kernel threads!

And no, kernel threads do not submit IO to disks on their own. You just
made that up. Yes, they can be involved in that whole disk submission
thing, but in a good way - they can be required in order to make disk
writing work!

The problem that suspend has had is that it's done everything totally the
wrong way around. Do kernel threads do disk IO? Sure, if asked to do so.
For example, kernel threads can be involved in md etc, but that's a *good*
thing. The way to shut them up is not to freeze the threads, but to freeze
the *disk*.

Linus

Nigel Cunningham

unread,

Apr 27, 2007, 7:20:07 PM4/27/07

to

Hi.

Just to let you know - I'm not ignoring your message. It's just taking
some time to think through the issues and try to formulate a good reply.
Oh, and of course there are a gazillion other messages flying about at
the moment that need attention too.

Regards,

Nigel

signature.asc

Rafael J. Wysocki

unread,

Apr 27, 2007, 7:20:08 PM4/27/07

to

On Saturday, 28 April 2007 00:26, David Lang wrote:
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> >>> We're freezing many of them just fine. ;-)
> >>
> >> And can you name a _single_ advantage of doing so?
> >
> > Yes. We have a lot less interdependencies to worry about during the whole
> > operation.
> >
> >> It so happens, that most people wouldn't notice or care that kmirrord got
> >> frozen (kernel thread picked at random - it might be one of the threads
> >> that has gotten special-cased to not do that), but I have yet to hear a
> >> single coherent explanation for why it's actually a good idea in the first
> >> place.
> >
> > Well, I don't know if that's a 'coherent' explanation from your point of view
> > (probably not), but I'll try nevertheless:
> > 1) if the kernel threads are frozen, we know that they don't hold any locks
> > that could interfere with the freezing of device drivers,
>
> does teh process of freezing really wait until all locks have been released?

Yes, it does.

> > 2) if they are frozen, we know, for example, that they won't call user mode
> > helpers or do similar things,
>
> this won't matter unless the user mode helpers are going to do I/O or other
> permanent changes

Please note that even accessing a file may be a permanent change.

> > 3) if they are frozen, we know that they won't submit I/O to disks and
> > potentially damage filesystems (suspend2 has much more problems with that
> > than swsusp, but still. And yes, there have been bug reports related to it,
> > so it's not just my fantasy).
>
> if you have the filesystems checkpointed then I/O after the freeze won't matter
> as you just revert to the checkpoint (and since this is going to be thrown away
> it can stay in ram)

In that case, I would agree. Currently, however, we're not even close to this
point.

The checkpointing of filesystems would be a very welcome feature, but there's
no anyone working on it right now, AFAICT.

> if we are willing to make a break with the past to implement the new snapshot
> capability, we should be able to use the LVM snapshot code to handle the
> filesystem

Yes, we can do that, in principle, and screw all of the current users in the
process. And finally we'd end up with something similar to what is done now,
IMHO.

And no, the things are not just totally broken, as it may follow from these
discussions. The problem is that the people who are discussing them so
viciously have never tried to write anything like the hibernation code.

This is as though as I were discussing the design of the CPU schedulers,
although I only know how they work on a general level.

Actually, the really problematic thing with the hibernation _right_ _now_ is
what Linus is so concerned about (and rightfully so) - that we use the
same device drivers' callbacks for the hibernation and suspend (aka s2ram).
The other things work quite well and are really robust.

David Lang

unread,

Apr 27, 2007, 7:40:06 PM4/27/07

to

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:

> On Saturday, 28 April 2007 00:26, David Lang wrote:
>> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>>
>>>>> We're freezing many of them just fine. ;-)
>>>>
>>>> And can you name a _single_ advantage of doing so?
>>>
>>> Yes. We have a lot less interdependencies to worry about during the whole
>>> operation.
>>>
>>>> It so happens, that most people wouldn't notice or care that kmirrord got
>>>> frozen (kernel thread picked at random - it might be one of the threads
>>>> that has gotten special-cased to not do that), but I have yet to hear a
>>>> single coherent explanation for why it's actually a good idea in the first
>>>> place.
>>>
>>> Well, I don't know if that's a 'coherent' explanation from your point of view
>>> (probably not), but I'll try nevertheless:
>>> 1) if the kernel threads are frozen, we know that they don't hold any locks
>>> that could interfere with the freezing of device drivers,
>>
>> does teh process of freezing really wait until all locks have been released?
>
> Yes, it does.
>
>>> 2) if they are frozen, we know, for example, that they won't call user mode
>>> helpers or do similar things,
>>
>> this won't matter unless the user mode helpers are going to do I/O or other
>> permanent changes
>
> Please note that even accessing a file may be a permanent change.

if accessing a file on a read-only filesystem changes that filesystem it's a bug

see the recent thread about ext3 journal replays when mounting read-only as an
example.

>>> 3) if they are frozen, we know that they won't submit I/O to disks and
>>> potentially damage filesystems (suspend2 has much more problems with that
>>> than swsusp, but still. And yes, there have been bug reports related to it,
>>> so it's not just my fantasy).
>>
>> if you have the filesystems checkpointed then I/O after the freeze won't matter
>> as you just revert to the checkpoint (and since this is going to be thrown away
>> it can stay in ram)
>
> In that case, I would agree. Currently, however, we're not even close to this
> point.
>
> The checkpointing of filesystems would be a very welcome feature, but there's
> no anyone working on it right now, AFAICT.
>
>> if we are willing to make a break with the past to implement the new snapshot
>> capability, we should be able to use the LVM snapshot code to handle the
>> filesystem
>
> Yes, we can do that, in principle, and screw all of the current users in the
> process. And finally we'd end up with something similar to what is done now,
> IMHO.

however, the result may be a lot less 'special case pwoer management' code and a
lot more re-use of code that's in place for other uses.

if work on the current versions was stopped (other then trying to avoid
regressions) and a new version (with new userspace tools) was built in a way
that satisfies everyone the old version could be phased out in a year or two
(per the normal feture removal process)

> And no, the things are not just totally broken, as it may follow from these
> discussions. The problem is that the people who are discussing them so
> viciously have never tried to write anything like the hibernation code.
>
> This is as though as I were discussing the design of the CPU schedulers,
> although I only know how they work on a general level.
>
> Actually, the really problematic thing with the hibernation _right_ _now_ is
> what Linus is so concerned about (and rightfully so) - that we use the
> same device drivers' callbacks for the hibernation and suspend (aka s2ram).
> The other things work quite well and are really robust.

if simply splitting the functions cleans everything up enough to satisfy
everyone then we're almost done right? ;-)

however I think that there are other fundamental disagreements here, and neither
the 'do absolutly everything in the kernel' or the 'do almost nothing in the
kernel' approaches are going to fly in the long run. I think the
userspace<->kernel interface is going to be different then either apprach is
doing now, and as such it's an oppurtunity to make more drastic changes if they
are appropriate.

for example, why should we have LVM snapshot code and hibernate
snapshot/filesystem checkpoint code instead of just useing the LVM code (which
gets excercised and tested far more then the other code ever would be)? saying
that if you want to suspend to disk you need to use LVM is a change, but it's
a change that people could probably live with.

David Lang

Rafael J. Wysocki

unread,

Apr 27, 2007, 7:50:07 PM4/27/07

to

On Saturday, 28 April 2007 01:17, Linus Torvalds wrote:
>
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > > And can you name a _single_ advantage of doing so?
> >
> > Yes. We have a lot less interdependencies to worry about during the whole
> > operation.
>
> That's not an advantage. That's why it has *sucked*.

Actually, the less things happen while we're creating and saving the image,
the less sources of potential problems there are and by freezing the kernel
threads (not all of them), we cause less things to happen at that time.

To make you happy, we could stop doing that, but what actual _advantage_
that would bring?

> Trying to freeze kernel threads has _caused_ problems. It has _added_
> these interdependencies. It hasn't removed a single dependency at any
> time, it has just added new problems!

What problems are you talking about?

> > 1) if the kernel threads are frozen, we know that they don't hold any locks
> > that could interfere with the freezing of device drivers,
> > 2) if they are frozen, we know, for example, that they won't call user mode
> > helpers or do similar things,
> > 3) if they are frozen, we know that they won't submit I/O to disks and
> > potentially damage filesystems (suspend2 has much more problems with that
> > than swsusp, but still. And yes, there have been bug reports related to it,
> > so it's not just my fantasy).
>
> NONE of these are valid explanations at all. You're listing totally
> theoretical problems, and ignoring all the _real_ problems that trying to
> freeze kernel threads has _caused_.

Example, please?

> If you want to control user-mode helpers, you do that - you do not freeze
> kernel threads!
>
> And no, kernel threads do not submit IO to disks on their own. You just
> made that up.

No, I didn't. Nigel can confirm, I think.

> Yes, they can be involved in that whole disk submission thing, but in a good
> way - they can be required in order to make disk writing work!

Some of them can be, some other's need not be. We don't need any fs-related
kernel threads for saving the image, for example.

> The problem that suspend has had is that it's done everything totally the
> wrong way around. Do kernel threads do disk IO? Sure, if asked to do so.

They can be asked before we do the snapshot and complete the operation
afterwards, no?

> For example, kernel threads can be involved in md etc, but that's a *good*
> thing.

We don't freeze these threads.

> The way to shut them up is not to freeze the threads, but to freeze the *disk*.

In principle, you're right. In practice, go and try it.

Anyway, why is it so important that _all_ of the kernel threads be running
while the snapshot is created and saved?

Rafael J. Wysocki

unread,

Apr 27, 2007, 8:00:11 PM4/27/07

to

Oh well. Is this really wrong to protect users from such bugs, if we can do
that?

> >>> 3) if they are frozen, we know that they won't submit I/O to disks and
> >>> potentially damage filesystems (suspend2 has much more problems with that
> >>> than swsusp, but still. And yes, there have been bug reports related to it,
> >>> so it's not just my fantasy).
> >>
> >> if you have the filesystems checkpointed then I/O after the freeze won't matter
> >> as you just revert to the checkpoint (and since this is going to be thrown away
> >> it can stay in ram)
> >
> > In that case, I would agree. Currently, however, we're not even close to this
> > point.
> >
> > The checkpointing of filesystems would be a very welcome feature, but there's
> > no anyone working on it right now, AFAICT.
> >
> >> if we are willing to make a break with the past to implement the new snapshot
> >> capability, we should be able to use the LVM snapshot code to handle the
> >> filesystem
> >
> > Yes, we can do that, in principle, and screw all of the current users in the
> > process. And finally we'd end up with something similar to what is done now,
> > IMHO.
>
> however, the result may be a lot less 'special case pwoer management' code and a

Are you referring to some specific code?

> lot more re-use of code that's in place for other uses.

This already is happening.

> if work on the current versions was stopped (other then trying to avoid
> regressions) and a new version (with new userspace tools) was built in a way
> that satisfies everyone the old version could be phased out in a year or two
> (per the normal feture removal process)

May I say it's not realistic?

> > And no, the things are not just totally broken, as it may follow from these
> > discussions. The problem is that the people who are discussing them so
> > viciously have never tried to write anything like the hibernation code.
> >
> > This is as though as I were discussing the design of the CPU schedulers,
> > although I only know how they work on a general level.
> >
> > Actually, the really problematic thing with the hibernation _right_ _now_ is
> > what Linus is so concerned about (and rightfully so) - that we use the
> > same device drivers' callbacks for the hibernation and suspend (aka s2ram).
> > The other things work quite well and are really robust.
>
> if simply splitting the functions cleans everything up enough to satisfy
> everyone then we're almost done right? ;-)

Practically, yes. Theoretically, there's no software you can't improve
(except, probably, TeX), but that might not be worth the effort.

> however I think that there are other fundamental disagreements here, and neither
> the 'do absolutly everything in the kernel' or the 'do almost nothing in the
> kernel' approaches are going to fly in the long run.

I think we'll have an agreement, though.

> I think the userspace<->kernel interface is going to be different then
> either apprach is doing now,

You're probably right

> and as such it's an oppurtunity to make more drastic changes if they are
> appropriate.

Well, maybe.

> for example, why should we have LVM snapshot code and hibernate
> snapshot/filesystem checkpoint code instead of just useing the LVM code (which
> gets excercised and tested far more then the other code ever would be)? saying
> that if you want to suspend to disk you need to use LVM is a change, but it's
> a change that people could probably live with.

Well, that's a theory. Probably a good one, but still. :-)

The positive aspect of all this is that people have started to pay attention to
what we're doing, and gradually they will learn about the problems that they're
just not seeing right now.

Greetings,
Rafael

Nigel Cunningham

unread,

Apr 27, 2007, 8:00:11 PM4/27/07

to

Hi.

On Sat, 2007-04-28 at 01:45 +0200, Rafael J. Wysocki wrote:
> On Saturday, 28 April 2007 01:17, Linus Torvalds wrote:
> >
> > On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > >
> > > > And can you name a _single_ advantage of doing so?
> > >
> > > Yes. We have a lot less interdependencies to worry about during the whole
> > > operation.
> >
> > That's not an advantage. That's why it has *sucked*.
>
> Actually, the less things happen while we're creating and saving the image,
> the less sources of potential problems there are and by freezing the kernel
> threads (not all of them), we cause less things to happen at that time.
>
> To make you happy, we could stop doing that, but what actual _advantage_
> that would bring?

A couple of other advantages to freezing other processes:

1) It makes predicting how much memory is available for making and
saving snapshot a tractable problem. It therefore makes hibernation
_much_ more reliable.
2) Racing against other processes would also make hibernation slower,
increasing the chances of your battery running out before the save is
complete.
3) It makes finding potential memory leaks in the code possible. It was
ages ago now, but at one stage I could display a table saying exactly
how many pages had been allocated and freed by different sections of the
process and compare the number of free pages at the start and end of the
cycle to ensure there were no memory leaks at all.

> > Trying to freeze kernel threads has _caused_ problems. It has _added_
> > these interdependencies. It hasn't removed a single dependency at any
> > time, it has just added new problems!
>
> What problems are you talking about?
>
> > > 1) if the kernel threads are frozen, we know that they don't hold any locks
> > > that could interfere with the freezing of device drivers,
> > > 2) if they are frozen, we know, for example, that they won't call user mode
> > > helpers or do similar things,
> > > 3) if they are frozen, we know that they won't submit I/O to disks and
> > > potentially damage filesystems (suspend2 has much more problems with that
> > > than swsusp, but still. And yes, there have been bug reports related to it,
> > > so it's not just my fantasy).
> >
> > NONE of these are valid explanations at all. You're listing totally
> > theoretical problems, and ignoring all the _real_ problems that trying to
> > freeze kernel threads has _caused_.
>
> Example, please?

I agree with Rafael. Freezing processes greatly helps in ensuring we
have a consistent image. He's right, too, in asserting that it's even
more important for Suspend2. Freezing processes is essential to being
able to know that those LRU pages won't change and therefore being able
to save them separately and then reuse them for the atomic copy.

> > If you want to control user-mode helpers, you do that - you do not freeze
> > kernel threads!
> >
> > And no, kernel threads do not submit IO to disks on their own. You just
> > made that up.
>
> No, I didn't. Nigel can confirm, I think.

I have had problems with MD threads generating I/O that I couldn't
account for - after userspace had been frozen, filesystems had been
nicely synced and so on. I have to speak with reservations though,
because I haven't yet gotten to the bottom of where the I/O is coming
from... too many things, too small time slices.

> > Yes, they can be involved in that whole disk submission thing, but in a good
> > way - they can be required in order to make disk writing work!
>
> Some of them can be, some other's need not be. We don't need any fs-related
> kernel threads for saving the image, for example.

Yeah, so long as we bmap the storage we want to use beforehand (thinking
of swap files and ordinary files).

> > The problem that suspend has had is that it's done everything totally the
> > wrong way around. Do kernel threads do disk IO? Sure, if asked to do so.
>
> They can be asked before we do the snapshot and complete the operation
> afterwards, no?
>
> > For example, kernel threads can be involved in md etc, but that's a *good*
> > thing.
>
> We don't freeze these threads.
>
> > The way to shut them up is not to freeze the threads, but to freeze the *disk*.
>
> In principle, you're right. In practice, go and try it.

I have to disagree here. Freezing the disk instead of the threads is
dealing with the symptoms instead of the cause.

Regards,

Nigel

signature.asc

Linus Torvalds

unread,

Apr 27, 2007, 8:10:06 PM4/27/07

to

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> Actually, the less things happen while we're creating and saving the image,
> the less sources of potential problems there are and by freezing the kernel
> threads (not all of them), we cause less things to happen at that time.

That makes no sense.

You have to create the snapshot image with interrupts disabled *anyway*.

I really don't see how you can say that stopping threads etc can make any
difference what-so-ever. If you don't create the snapshot with interrupts
disabled (and just with a single CPU running) you have so many other
problems that it's not even remotely funny.

So there's *by*definition* nothing at all that can happen while you
snapshot the system. Claiming otherwise is just silly.

> To make you happy, we could stop doing that, but what actual _advantage_
> that would bring?

Like getting rid of all the magic "I don't want you to freeze me" crud?

Or getting rid of this horribly idiotic "three times widdershins" kind of
black magic mentality! It looks like the main reason for the process
freezing has nothing to do with technology, but some irrational fear of
other things happening at the same time, even though they CANNOT happen if
you do things even half-way sanely.

The "let's stop all kernel threads" is superstition. It's the same kind of
superstition that made people write "sync" three times before turning off
the power in the olden times. It's the kind of superstition that comes
from "we don't do things right, so let's be vewy vewy quiet and _pray_
that it works when we are beign quiet".

That's bad.

It's doubly bad, because that idiocy has also infected s2ram. Again,
another thing that really makes no sense at all - and we do it not just
for snapshotting, but for s2ram too. Can you tell me *why*?

> > Trying to freeze kernel threads has _caused_ problems. It has _added_
> > these interdependencies. It hasn't removed a single dependency at any
> > time, it has just added new problems!
>
> What problems are you talking about?

Like you wouldn't know. Look at commit b43376927a that you yourself are
credited with, just a month ago.

Then, do something as simple as

git grep create_freezeable_workthread

and ponder the end results of that grep. If you don't see something wrong,
you're blind.

> > NONE of these are valid explanations at all. You're listing totally
> > theoretical problems, and ignoring all the _real_ problems that trying to
> > freeze kernel threads has _caused_.
>
> Example, please?

Who do you think you are kidding? See above.

And if you think that's an isolated example, look again. And start
grepping for PF_NOFREEZE, and other examples.

The fact is, there is not a *single* reason to freeze kernel threads. But
some rocket scientist decided to, and then screwed everybody else over.

Linus

Jeremy Fitzhardinge

unread,

Apr 27, 2007, 8:20:06 PM4/27/07

to

Linus Torvalds wrote:
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>
>> Why do you think that keeping the user space frozen after 'snapshot' is a bad
>> idea? I think that solves many of the problems you're discussing.
>>
>
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
>
> gdb -p <snapshotter>
>
> when something goes wrong?)

Yeah, or gdb vmlinux snapshot

Then you could use kexec for resume...

J

David Lang

unread,

Apr 27, 2007, 8:30:07 PM4/27/07

to

nobody is suggesting that you leave peocesses running while you do the snapshot,
what is being proposed is

1. pause userspace (prevent scheduling)
2. make snapshot image of memory
3. make mounted filesystems read-only (possibly with snapshot/checkpoint)
4. unpause
5. save image (with full userspace available, including network)
6. shutdown system (throw away all userspace memory, no need to do graceful
shutdown or nice kill signals, revert filesystem to snapshot/checkpoint if
needed)

>>> NONE of these are valid explanations at all. You're listing totally
>>> theoretical problems, and ignoring all the _real_ problems that trying to
>>> freeze kernel threads has _caused_.
>>
>> Example, please?
>
> I agree with Rafael. Freezing processes greatly helps in ensuring we
> have a consistent image. He's right, too, in asserting that it's even
> more important for Suspend2. Freezing processes is essential to being
> able to know that those LRU pages won't change and therefore being able
> to save them separately and then reuse them for the atomic copy.

all that's needed for the snapshot is to prevent userspace from scheduling, and
prevent media from being written to in a permanent way (writing to a LVM volume
after invoking a snapshot doesn't count, just revert to the snapshot)

David Lang

Linus Torvalds

unread,

Apr 27, 2007, 8:30:10 PM4/27/07

to

On Fri, 27 Apr 2007, Linus Torvalds wrote:
>
> The "let's stop all kernel threads" is superstition. It's the same kind of
> superstition that made people write "sync" three times before turning off
> the power in the olden times. It's the kind of superstition that comes
> from "we don't do things right, so let's be vewy vewy quiet and _pray_
> that it works when we are beign quiet".

Side note: while I think things should probably *work* even with user
processes going full bore while a snapshot it taken, I'll freely admit
that I'll follow that superstition far enough that I think it's probably a
good idea to try to quiesce the system to _some_ degree, and that stopping
user programs is a good idea. Partly because the whole memory shrinking
thing, and partly just because we should do the snapshot with hw IO queues
empty.

But I don't think it would necessarily be wrong (and in many ways it would
probably be *right*) to do that IO queue stopping at the queue level
rather than at a process level. Why stop processes just becasue you want
to clean out IO queues? They are two totally different things!

Linus Torvalds

unread,

Apr 27, 2007, 8:50:06 PM4/27/07

to

On Fri, 27 Apr 2007, David Lang wrote:
>
> all that's needed for the snapshot is to prevent userspace from scheduling,

Strictly speaking, all you *really* want to make sure is not so much that
user-space isn't scheduling, as the fact that all device IO buffers must
be empty.

We can trivially snapshot an active user-space, and in fact it would
probably be hard to do a snapshot in a way that it could even *know* or
care about whether there are user-space processes running at the time of
the snapshot.

So that's not the real problem.

What we obviously *cannot* snapshot is if some particular device is in the
middle of being written to or read from, and has outstanding commands on
the device itself (as opposed to just queued to the driver). So what we do
want to make sure happens is that there are no IO queues that are active.

And the best way to make sure that there are no IO queues active is to
make sure that there are no new read or write-requests. And *that* you can
do two ways:

- actually intercepting the read/write requests. Probably not too hard,
we could literally do it in the IO scheduler (and probably much more
easily than doing it in the process scheduler), but the easy cases will
only cover the block device layer, and character devices don't have the
same kind of scheduler you can trap IO in.

- we also don't want to generate new data that needs to be snapshotted,
so we want to trap people who write even just to the page cache and
turn pages dirty. Again, we could probably do it at *that* point (ie
trapping them when they try to dirty a page), and it would be more
logical, but again, there are other cases of people who generate more
data (just any memory allocation obviously is a special case of
generating more data to be snapshotted),

so I do agree that we want to stop producing new data to be snapshotted,
and we want to stop producing new read-requests. But kernel threads really
do neither: in an idle system, kernel threads are idle too. A kernel
thread is not like a user program that actually generates data - they only
tend to act on behalf of other processes' needs.

So I think that what snapshotting really *wants* to stop is not schedulign
per se, but IO. And stopping user processes (as opposed to kernel threads)
is probably a good way to get there.

In fact, I'd argue that you want to stop user space and then encourage
some kernel threads to *start* running, notably things like bdflush should
probably be kicked to clean up some dirty stuff as part of the "shrink
data to be snapshotted" part. Trying to free memory will do that on its
own, of course.

Linus

Paul Mackerras

unread,

Apr 27, 2007, 9:00:11 PM4/27/07

to

Linus Torvalds writes:

> I really don't see how you can say that stopping threads etc can make any
> difference what-so-ever. If you don't create the snapshot with interrupts
> disabled (and just with a single CPU running) you have so many other
> problems that it's not even remotely funny.

I agree. I don't like the freezer. We have had working
kernel-controlled suspend to RAM on powerbooks for almost 10 years
now, and we never needed to freeze processes.

That said, I can see two attractions in freezing processes:

1. It provides a way to stop new I/O requests coming in, and thus
somewhat makes up for the lack of a way to freeze device request
queues (at least, we didn't have one last time I looked).

2. Systems do sometimes die while suspended (e.g. run out of battery,
or the resume process fails), and to make the next boot painless,
you want the filesystems on disk to be as clean as possible.
Freezing processes and then doing a sync provides one way to
achieve that. Of course, you have to make sure you don't freeze
any kernel threads that are needed for doing the sync... And if
one of your filesystems is using FUSE, it's not going to get very
far.

Paul.

Rafael J. Wysocki

unread,

Apr 27, 2007, 9:00:12 PM4/27/07

to

On Saturday, 28 April 2007 01:59, Linus Torvalds wrote:
>
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > Actually, the less things happen while we're creating and saving the image,
> > the less sources of potential problems there are and by freezing the kernel
> > threads (not all of them), we cause less things to happen at that time.
>
> That makes no sense.
>
> You have to create the snapshot image with interrupts disabled *anyway*.
>
> I really don't see how you can say that stopping threads etc can make any
> difference what-so-ever. If you don't create the snapshot with interrupts
> disabled (and just with a single CPU running) you have so many other
> problems that it's not even remotely funny.
>
> So there's *by*definition* nothing at all that can happen while you
> snapshot the system. Claiming otherwise is just silly.

For creating the snapshot alone, it doesn't matter. Except that the restore is
cleaner a bit (we know exactly what all of these threads will be doing when
we restore the image and enable the IRQs after that).

Still, I think that kernel threads can potentailly hold locks accross the
freezing of devices and image creation and that is fishy. Also I believe,
although I'm not 100% sure, that some of them may cause problems to
appear after we've created the image and while we are saving it.

> > To make you happy, we could stop doing that, but what actual _advantage_
> > that would bring?
>
> Like getting rid of all the magic "I don't want you to freeze me" crud?

And what exactly is wrong with it?

> Or getting rid of this horribly idiotic "three times widdershins" kind of
> black magic mentality! It looks like the main reason for the process
> freezing has nothing to do with technology, but some irrational fear of
> other things happening at the same time, even though they CANNOT happen if
> you do things even half-way sanely.
>
> The "let's stop all kernel threads" is superstition. It's the same kind of
> superstition that made people write "sync" three times before turning off
> the power in the olden times. It's the kind of superstition that comes
> from "we don't do things right, so let's be vewy vewy quiet and _pray_
> that it works when we are beign quiet".
>
> That's bad.

Okay. Accidentally, I'm working on a freezer patch, so I'll probably drop
the freezing of kernel threads from swsusp in it and we'll see what happens.

Let's do the experiment, shall we?

> It's doubly bad, because that idiocy has also infected s2ram. Again,
> another thing that really makes no sense at all - and we do it not just
> for snapshotting, but for s2ram too. Can you tell me *why*?

Why we freeze tasks at all or why we freeze kernel threads?

> > > Trying to freeze kernel threads has _caused_ problems. It has _added_
> > > these interdependencies. It hasn't removed a single dependency at any
> > > time, it has just added new problems!
> >
> > What problems are you talking about?
>
> Like you wouldn't know. Look at commit b43376927a that you yourself are
> credited with, just a month ago.
>
> Then, do something as simple as
>
> git grep create_freezeable_workthread

s/workthread/workqueue/

> and ponder the end results of that grep. If you don't see something wrong,
> you're blind.

This was a mistake, quite unrelated to the point you're making. And actually,
I was trying to fix a problem with two kernel threads that we thought might
submit I/O to disk after the image had been created. Otherwise I wouldn't
have thought of doing that change.

> > > NONE of these are valid explanations at all. You're listing totally
> > > theoretical problems, and ignoring all the _real_ problems that trying to
> > > freeze kernel threads has _caused_.
> >
> > Example, please?
>
> Who do you think you are kidding? See above.

Well, if someone does something in a wrong way, that need not mean the
thing he was trying to do was wrong.

Somehow, I knew you would point at this ...

> And if you think that's an isolated example, look again. And start
> grepping for PF_NOFREEZE, and other examples.

May I say I'm not convinced?

> The fact is, there is not a *single* reason to freeze kernel threads. But
> some rocket scientist decided to, and then screwed everybody else over.

At least _that_ wasn't me. :-)

Greetings,
Rafael

Matthew Garrett

unread,

Apr 27, 2007, 9:10:11 PM4/27/07

to

On Fri, Apr 27, 2007 at 05:18:16PM -0700, Jeremy Fitzhardinge wrote:

> Then you could use kexec for resume...

While that would certainly be nifty, I think we're arguably starting
from the wrong point here. Why are we booting a kernel, trying to poke
the hardware back into some sort of mock-quiescent state, freeing memory
and then (finally) overwriting the entire contents of RAM rather than
just doing all of this from the bootloader? Given the time spent in
kernel setup and unpacking initramfs nowadays, I'm willing to bet it'd
still be faster even if you're stuck using int 13 on x86.

http://apcmag.com/5873/page14 suggests that Intel is looking into this,
but I haven't heard anything more yet. To the best of my knowledge, this
is also how Windows manages things.
--
Matthew Garrett | mj...@srcf.ucam.org

Kyle Moffett

unread,

Apr 27, 2007, 9:10:12 PM4/27/07

to

On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:
> Hi.
>
> On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
>> It makes it harder to debug (wouldn't it be *nice* to just ssh in,
>> and do
>> gdb -p <snapshotter>
>
> Make the machine being suspended a VM and you can already do that.

>> when something goes wrong?) but we also *depend* on user space for
>> various things (the same way we depend on kernel threads, and why
>> it has been such a total disaster to try to freeze the kernel
>> threads too!). For example, if you want to do graphical stuff,
>> just using X would be quite nice, wouldn't it?
>

> But in doing so you make the contents of the disk inconsistent with
> the state you've just snapshotted, leading to filesystem
> corruption. Even if you modify filesystems to do checkpointing
> (which is what we're really talking about), you still also have the
> problem that your snapshot has to be stored somewhere before you

> write it to disk, so you also have to either [snip]

Actually, it's a lot simpler than that. We can just combine the
device-mapper snapshot with a VM+kernel snapshot system call and be
almost done:

sys_snapshot(dev_t snapblockdev, int __user *snapshotfd);

When sys_snapshot is run, the kernel does:

1) Sequentially freeze mounted filesystems using blockdev freezing.
If it's an fs that doesn't support freezing then either fail or force-
remount-ro that fs and downgrade all its filedescriptors to RO.
Doesn't need extra locking since process which try to do IO either
succeed before the freeze call returns for that blockdev or sleep on
the unfreeze of that blockdev. Filesystems are synchronized and made
clean.
2) Iterate over the userspace process list, freezing each process
and remapping all of its pages copy-on-write. Any device-specific
pages need to have state saved by that device.
3) All processes (except kernel threads) are now frozen.
4) Kernel should save internal state corresponding to current
userspace state. The kernel also swaps out excess pages to free up
enough RAM and prepares the snapshot file-descriptor with copies of
kernel memory and the original (pre-COW) mapped userspace pages.
5) Kernel substitutes filesystems for either a device-mapper
snapshot with snapblockdev as backing storage or union with tmpfs and
remounts the underlying filesystems as read-only.
6) Kernel unfreezes all userspace processes and returns the snapshot
FD to userspace (where it can be read from).

Then userspace can do whatever it wants. Any changes to filesystems
mounted at the time of snapshot will be discarded at shutdown.
Freshly mounted filesystems won't have the union or COW thing done,
and so you can write your snapshot to a compressed encrypted file on
a USB key if you want to, you just have to unmount it before the
snapshot() syscall and remount it right afterwards.

Cheers,
Kyle Moffett

Jeremy Fitzhardinge

unread,

Apr 27, 2007, 9:10:13 PM4/27/07

to

Matthew Garrett wrote:
> While that would certainly be nifty, I think we're arguably starting
> from the wrong point here. Why are we booting a kernel, trying to poke
> the hardware back into some sort of mock-quiescent state, freeing memory
> and then (finally) overwriting the entire contents of RAM rather than
> just doing all of this from the bootloader?

Sure, you could make suspend generate a complete bootable kernel image
containing all RAM. Doesn't sound too hard to me. You know, from over
here on the sidelines.

J

Rafael J. Wysocki

unread,

Apr 27, 2007, 9:10:12 PM4/27/07

to

On Saturday, 28 April 2007 03:00, Matthew Garrett wrote:
> On Fri, Apr 27, 2007 at 05:18:16PM -0700, Jeremy Fitzhardinge wrote:
>
> > Then you could use kexec for resume...
>
> While that would certainly be nifty, I think we're arguably starting
> from the wrong point here. Why are we booting a kernel, trying to poke
> the hardware back into some sort of mock-quiescent state, freeing memory
> and then (finally) overwriting the entire contents of RAM rather than
> just doing all of this from the bootloader? Given the time spent in
> kernel setup and unpacking initramfs nowadays, I'm willing to bet it'd
> still be faster even if you're stuck using int 13 on x86.

Yes, that would be faster.

> http://apcmag.com/5873/page14 suggests that Intel is looking into this,
> but I haven't heard anything more yet. To the best of my knowledge, this
> is also how Windows manages things.

I think you're right.

Greetings,
Rafael

Rafael J. Wysocki

unread,

Apr 27, 2007, 9:20:06 PM4/27/07

to

Why do you want to do 2) after 1) and not vice versa?

> 3) All processes (except kernel threads) are now frozen.
> 4) Kernel should save internal state corresponding to current
> userspace state. The kernel also swaps out excess pages to free up
> enough RAM and prepares the snapshot file-descriptor with copies of
> kernel memory and the original (pre-COW) mapped userspace pages.
> 5) Kernel substitutes filesystems for either a device-mapper
> snapshot with snapblockdev as backing storage or union with tmpfs and
> remounts the underlying filesystems as read-only.
> 6) Kernel unfreezes all userspace processes and returns the snapshot
> FD to userspace (where it can be read from).

Okay, but how do we do the error recovery if, for example, the image cannot
be saved?

> Then userspace can do whatever it wants. Any changes to filesystems
> mounted at the time of snapshot will be discarded at shutdown.
> Freshly mounted filesystems won't have the union or COW thing done,
> and so you can write your snapshot to a compressed encrypted file on
> a USB key if you want to, you just have to unmount it before the
> snapshot() syscall and remount it right afterwards.

This seems to be a good idea.

Greetings,
Rafael

Linus Torvalds

unread,

Apr 27, 2007, 9:20:10 PM4/27/07

to

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> > It's doubly bad, because that idiocy has also infected s2ram. Again,
> > another thing that really makes no sense at all - and we do it not just
> > for snapshotting, but for s2ram too. Can you tell me *why*?
>
> Why we freeze tasks at all or why we freeze kernel threads?

In many ways, "at all".

I _do_ realize the IO request queue issues, and that we cannot actually do
s2ram with some devices in the middle of a DMA. So we want to be able to
avoid *that*, there's no question about that. And I suspect that stopping
user threads and then waiting for a sync is practically one of the easier
ways to do so.

So in practice, the "at all" may become a "why freeze kernel threads?" and
freezing user threads I don't find really objectionable.

But as Paul pointed out, Linux on the old powerpc Mac hardware was
actually rather famous for having working (and reliable) suspend long
before it worked even remotely reliably on PC's. And they didn't do even
that.

(They didn't have ACPI, and they had a much more limited set of devices,
but the whole process freezer is really about neither of those issues. The
wild and wacky PC hardware has its problems, but that's _one_ thing we
can't blame PC hardware for ;)

> > git grep create_freezeable_workthread
>
> s/workthread/workqueue/

Yes.

> > and ponder the end results of that grep. If you don't see something wrong,
> > you're blind.
>
> This was a mistake, quite unrelated to the point you're making.

Did you actually _do_ the "grep" (with the fixed argument)?

I had two totally independent points. #1 was that you yourself have been
fixing bugs in this area. #2 was the result of that grep. It's absolutely
_empty_ except for the define to add that interface.

NOBODY USES IT!

Now, grep for the same interface that creates _non_freezeable workqueues.

Put another way:

[torvalds@woody linux]$ git grep create_workqueue | wc -l
35

[torvalds@woody linux]$ git grep create_freezeable_workqueue | wc -l
1

and that _one_ hit you get for the "freezeable" case is not actually a
user, it's the definition!

Ie my point is, nobody wants freezeable kernel threads. Absolutely nobody.

Yet we have all this support for freezing them (or rather, we freeze them
by default, and then we have all this support for _not_ doing that wrong
default thing!)

So yes, I think it would be interesting to just stop freezing kernel
threads. Totally.

Linus

Bojan Smojver

unread,

Apr 27, 2007, 9:30:13 PM4/27/07

to

Nigel Cunningham <nigel <at> nigel.suspend2.net> writes:

> 4) uswsusp and swsusp get dropped and Suspend2 goes into mainline.

After reading most of this thread, it seems that Linus is of the view that all
three of these suck in one way or another. Suspend2 has the most features and is
the fastest of the lot. It can behave like swsusp from the user's point of view
(i.e. echo disk > /sys/power/state), so the migration should be seamless for
most distros. It isn't complicated to set up. It's been proven in the field. It
looks pretty.

So, while we're waiting for the next STD technology, why not have the best and
develop from there?

--
Bojan

David Lang

unread,

Apr 27, 2007, 9:30:12 PM4/27/07

to

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:

it doesn't really need to matter. if you care, just arrange to not schedule user
processes while you are doing both steps.

>> 3) All processes (except kernel threads) are now frozen.
>> 4) Kernel should save internal state corresponding to current
>> userspace state. The kernel also swaps out excess pages to free up
>> enough RAM and prepares the snapshot file-descriptor with copies of
>> kernel memory and the original (pre-COW) mapped userspace pages.
>> 5) Kernel substitutes filesystems for either a device-mapper
>> snapshot with snapblockdev as backing storage or union with tmpfs and
>> remounts the underlying filesystems as read-only.
>> 6) Kernel unfreezes all userspace processes and returns the snapshot
>> FD to userspace (where it can be read from).
>
> Okay, but how do we do the error recovery if, for example, the image cannot
> be saved?

give the user an error message telling him this, wait for confirmation, and then
jump directly to the restore step. revert everything to the snapshot image(s),
restart it.

David Lang

Kyle Moffett

unread,

Apr 27, 2007, 9:30:14 PM4/27/07

to

On Apr 27, 2007, at 21:15:28, Rafael J. Wysocki wrote:
> On Saturday, 28 April 2007 03:03, Kyle Moffett wrote:
>> On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:
>>> But in doing so you make the contents of the disk inconsistent
>>> with the state you've just snapshotted, leading to filesystem
>>> corruption. Even if you modify filesystems to do checkpointing
>>> (which is what we're really talking about), you still also have
>>> the problem that your snapshot has to be stored somewhere before
>>> you write it to disk, so you also have to either [snip]
>>

>> When sys_snapshot is run, the kernel does:
>>
>> 1) Sequentially freeze mounted filesystems using blockdev
>> freezing. If it's an fs that doesn't support freezing then either

>> fail or force-remount-ro that fs and downgrade all its

>> filedescriptors to RO. Doesn't need extra locking since process
>> which try to do IO either succeed before the freeze call returns
>> for that blockdev or sleep on the unfreeze of that blockdev.
>> Filesystems are synchronized and made clean.
>> 2) Iterate over the userspace process list, freezing each process
>> and remapping all of its pages copy-on-write. Any device-specific
>> pages need to have state saved by that device.
>
> Why do you want to do 2) after 1) and not vice versa?

(1) can be done without extra locking. Device-mapper already has
code to freeze filesystems and that makes a natural process-stopping
point. Any threads doing IO will very quickly put themselves to
sleep at (1) and save us some effort during step 2.

>> 6) Kernel unfreezes all userspace processes and returns the
>> snapshot FD to userspace (where it can be read from).
>
> Okay, but how do we do the error recovery if, for example, the
> image cannot be saved?

If the image can't be saved then there are 2 options:
(1) Call sys_restore() with the image
(2) Pass your snapshot file-descriptor to sys_unsnapshot()

In the former case, the system will be restored to the state it was
at a few seconds earlier, right as it took the snapshot. In the
latter case the modified-in-memory snapshot pages will be synced back
to the disk filesystems, the copy-on-write data-structures torn down
(think of merging an LVM snapshot back into its base device), and the
memory allocated for the snapshot will be freed. Either way the
system is properly in sync with disk again, the only difference is
whether you want to preserve the userspace state from during the
attempted snapshot (IE: any error status). You could also save the
error state in case (1) by just auto-posting a bug-report on http://
bugs.$VENDOR.com/ of course :-D.

Cheers,
Kyle Moffett

David Lang

unread,

Apr 27, 2007, 9:40:07 PM4/27/07

to

On Fri, 27 Apr 2007, Linus Torvalds wrote:

> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>>
>>> It's doubly bad, because that idiocy has also infected s2ram. Again,
>>> another thing that really makes no sense at all - and we do it not just
>>> for snapshotting, but for s2ram too. Can you tell me *why*?
>>
>> Why we freeze tasks at all or why we freeze kernel threads?
>
> In many ways, "at all".
>
> I _do_ realize the IO request queue issues, and that we cannot actually do
> s2ram with some devices in the middle of a DMA. So we want to be able to
> avoid *that*, there's no question about that. And I suspect that stopping
> user threads and then waiting for a sync is practically one of the easier
> ways to do so.
>
> So in practice, the "at all" may become a "why freeze kernel threads?" and
> freezing user threads I don't find really objectionable.

there was a thread last week (or so) about splitting up the process list, one
list for normal user processes, one for kernel threads, and one for dead
processes waiting to be reaped.

it almost sounds like what you want to do is to act as if the normal user
threads weren't there for a short time (while you make the snapshot) and then
recover them to continue and save the snapshot.

David Lang

Rafael J. Wysocki

unread,

Apr 27, 2007, 9:50:06 PM4/27/07

to

On Saturday, 28 April 2007 03:12, Linus Torvalds wrote:
>
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > > It's doubly bad, because that idiocy has also infected s2ram. Again,
> > > another thing that really makes no sense at all - and we do it not just
> > > for snapshotting, but for s2ram too. Can you tell me *why*?
> >
> > Why we freeze tasks at all or why we freeze kernel threads?
>
> In many ways, "at all".
>
> I _do_ realize the IO request queue issues, and that we cannot actually do
> s2ram with some devices in the middle of a DMA. So we want to be able to
> avoid *that*, there's no question about that. And I suspect that stopping
> user threads and then waiting for a sync is practically one of the easier
> ways to do so.
>
> So in practice, the "at all" may become a "why freeze kernel threads?" and
> freezing user threads I don't find really objectionable.
>
> But as Paul pointed out, Linux on the old powerpc Mac hardware was
> actually rather famous for having working (and reliable) suspend long
> before it worked even remotely reliably on PC's. And they didn't do even
> that.
>
> (They didn't have ACPI, and they had a much more limited set of devices,
> but the whole process freezer is really about neither of those issues. The
> wild and wacky PC hardware has its problems, but that's _one_ thing we
> can't blame PC hardware for ;)

We freeze user space processes for the reasons that you have quoted above.

Why we freeze kernel threads in there too is a good question, but not for me to
answer. I don't know. Pavel should know, I think.

> > > git grep create_freezeable_workthread
> >
> > s/workthread/workqueue/
>
> Yes.
>
> > > and ponder the end results of that grep. If you don't see something wrong,
> > > you're blind.
> >
> > This was a mistake, quite unrelated to the point you're making.
>
> Did you actually _do_ the "grep" (with the fixed argument)?
>
> I had two totally independent points. #1 was that you yourself have been
> fixing bugs in this area. #2 was the result of that grep. It's absolutely
> _empty_ except for the define to add that interface.
>
> NOBODY USES IT!

The reason is pretty simple.

We wanted to drop that interface altogether, because it was broken (my fault),
but Oleg suggested that we keep it so that we could fix and use it in the
future (for purposes other than the hibernation, though).

> Now, grep for the same interface that creates _non_freezeable workqueues.
>
> Put another way:
>
> [torvalds@woody linux]$ git grep create_workqueue | wc -l
> 35
>
> [torvalds@woody linux]$ git grep create_freezeable_workqueue | wc -l
> 1
>
> and that _one_ hit you get for the "freezeable" case is not actually a
> user, it's the definition!
>
> Ie my point is, nobody wants freezeable kernel threads. Absolutely nobody.

That's freezable workqueues only. :-)

> Yet we have all this support for freezing them (or rather, we freeze them
> by default, and then we have all this support for _not_ doing that wrong
> default thing!)
>
> So yes, I think it would be interesting to just stop freezing kernel
> threads. Totally.

Okay, I'll do that.

Greetings,
Rafael

Daniel Hazelton

unread,

Apr 27, 2007, 11:00:10 PM4/27/07

to

On Friday 27 April 2007 21:44:48 Rafael J. Wysocki wrote:
> On Saturday, 28 April 2007 03:12, Linus Torvalds wrote:
> > On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > > > It's doubly bad, because that idiocy has also infected s2ram. Again,
> > > > another thing that really makes no sense at all - and we do it not
> > > > just for snapshotting, but for s2ram too. Can you tell me *why*?
> > >
> > > Why we freeze tasks at all or why we freeze kernel threads?
> >
> > In many ways, "at all".
> >
> > I _do_ realize the IO request queue issues, and that we cannot actually
> > do s2ram with some devices in the middle of a DMA. So we want to be able
> > to avoid *that*, there's no question about that. And I suspect that
> > stopping user threads and then waiting for a sync is practically one of
> > the easier ways to do so.
> >

<snip>

Apparently I *CANNOT* wrap my head around this - if just because my laptop,
running a vendor 2.6.17 kernel does s2ram perfectly, at least, it does when
using the "Upstart" init system rather than the classical SysV init system. I
have tried it with the classical init and the suspend isn't triggered by the
buttons that used to do it. I didn't try 'echo ram > /sys/power/state', but I
have a feeling that would have worked as well. I have problems with s2disk,
but thats because I keep my swap partition small - I try to keep it at or
around 256M when I have more than half a gig of Ram in a system. Perhaps one
of these days I'll grab a multi-gig flash disk, set it up as a swap partition
and try it again. (every time I've tried s2disk I wind up running out of disk
space - and this is with nothing but X running. Any kind of progress meter
for when the system is doing s2disk would be nice - every time I've tried it
all I see for the nearly 2 minutes before the s2disk attempt ends is a black
screen. I say 2 minutes because thats how long it takes for it to learn that
there isn't enough space on the swap-partition to save the image)

DRH

Pavel Machek

unread,

Apr 28, 2007, 3:10:23 AM4/28/07

to

Hi!

Just turn up console loglevel to see the messages.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Pavel Machek

unread,

Apr 28, 2007, 5:00:06 AM4/28/07

to

Hi!

> > In many ways, "at all".
> >
> > I _do_ realize the IO request queue issues, and that we cannot actually do
> > s2ram with some devices in the middle of a DMA. So we want to be able to
> > avoid *that*, there's no question about that. And I suspect that stopping
> > user threads and then waiting for a sync is practically one of the easier
> > ways to do so.
> >
> > So in practice, the "at all" may become a "why freeze kernel threads?" and
> > freezing user threads I don't find really objectionable.
> >
> > But as Paul pointed out, Linux on the old powerpc Mac hardware was
> > actually rather famous for having working (and reliable) suspend long
> > before it worked even remotely reliably on PC's. And they didn't do even
> > that.
> >
> > (They didn't have ACPI, and they had a much more limited set of devices,
> > but the whole process freezer is really about neither of those issues. The
> > wild and wacky PC hardware has its problems, but that's _one_ thing we
> > can't blame PC hardware for ;)
>
> We freeze user space processes for the reasons that you have quoted above.
>
> Why we freeze kernel threads in there too is a good question, but not for me to
> answer. I don't know. Pavel should know, I think.

We do not want kernel threads running:

a) they may hold some locks and deadlock suspend

b) they may do some writes to disk, leading to corruption

We could solve a) by carefully auditing suspend lock usage to make
sure deadlocks are impossible even with kernel threads running.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Oliver Neukum

unread,

Apr 28, 2007, 5:10:05 AM4/28/07

to

Am Samstag, 28. April 2007 01:50 schrieb David Lang:
> 3. make mounted filesystems read-only (possibly with snapshot/checkpoint)
> 4. unpause
> 5. save image (with full userspace available, including network)
> 6. shutdown system (throw away all userspace memory, no need to do graceful
> shutdown or nice kill signals, revert filesystem to snapshot/checkpoint if
> needed)

And then you'll have people wonder why the server which sent out all
those files has no log entries. You'd have to selectively unfreeze user
space, which is a cure worse than the desease.

Simply throwing away user space work is a bug. And no, you cannot say that
it'll be redone away, as you are throwing away accepted input, too.

Regards
Oliver

Pekka J Enberg

unread,

Apr 28, 2007, 5:20:10 AM4/28/07

to

On Sat, 28 Apr 2007, Oliver Neukum wrote:
> And then you'll have people wonder why the server which sent out all
> those files has no log entries. You'd have to selectively unfreeze user
> space, which is a cure worse than the desease.
>
> Simply throwing away user space work is a bug. And no, you cannot say that
> it'll be redone away, as you are throwing away accepted input, too.

It's not a bug, it's a feature =). While I totally agree with you that for
the common case, you probably do want to avoid work in the userspace after
taking the snapshot, it is something that should be solved separately.
There is absolutely nothing wrong with taking a snapshot, doing some work,
and then resuming to the snapshot and thus "losing" some the work (this
is useful for debugging, for example).

Pekka

Pekka Enberg

unread,

Apr 28, 2007, 5:30:10 AM4/28/07

to

Hi Oliver,

Am Freitag, 27. April 2007 12:12 schrieb Pekka J Enberg:
> > The problem with writing in the kernel is obvious: we need to add new code
> > to the kernel for compression, encryption, and userspace interaction
> > (graphical progress bar) that are important for user experience.

On 4/27/07, Oliver Neukum <oli...@neukum.org> wrote:
> The kernel can already do compression and encryption.

Yes, if we all could agree on _which_ compression and encryption
algorithm(s) we want to use. It goes beyond that too, where do you
want to save the image? In the swap device or a regular file? And
don't forget about debuggability either. It's faster to do a
snapshot/resume without shutdown/restart in the middle or just do a
snapshot, and examine its contents.

Rafael J. Wysocki

unread,

Apr 28, 2007, 5:30:17 AM4/28/07

to

On Saturday, 28 April 2007 10:50, Pavel Machek wrote:
> Hi!
>
> > > In many ways, "at all".
> > >
> > > I _do_ realize the IO request queue issues, and that we cannot actually do
> > > s2ram with some devices in the middle of a DMA. So we want to be able to
> > > avoid *that*, there's no question about that. And I suspect that stopping
> > > user threads and then waiting for a sync is practically one of the easier
> > > ways to do so.
> > >
> > > So in practice, the "at all" may become a "why freeze kernel threads?" and
> > > freezing user threads I don't find really objectionable.
> > >
> > > But as Paul pointed out, Linux on the old powerpc Mac hardware was
> > > actually rather famous for having working (and reliable) suspend long
> > > before it worked even remotely reliably on PC's. And they didn't do even
> > > that.
> > >
> > > (They didn't have ACPI, and they had a much more limited set of devices,
> > > but the whole process freezer is really about neither of those issues. The
> > > wild and wacky PC hardware has its problems, but that's _one_ thing we
> > > can't blame PC hardware for ;)
> >
> > We freeze user space processes for the reasons that you have quoted above.
> >
> > Why we freeze kernel threads in there too is a good question, but not for me to
> > answer. I don't know. Pavel should know, I think.
>
> We do not want kernel threads running:
>
> a) they may hold some locks and deadlock suspend

Yeah, the same issue as with the hibernation and I do think it's _real_.

> b) they may do some writes to disk, leading to corruption

Hmm, is that an issue in the suspend (aka s2ram) case?

> We could solve a) by carefully auditing suspend lock usage to make
> sure deadlocks are impossible even with kernel threads running.

Yes, we can, but for now it's not been done yet.

Greetings,
Rafael

Rafael J. Wysocki

unread,

Apr 28, 2007, 6:40:08 AM4/28/07

to

On Friday, 27 April 2007 12:12, Pekka J Enberg wrote:
> Am Freitag, 27. April 2007 08:18 schrieb Pekka J Enberg:
> > > No. The snapshot is just that. A snapshot in time. From kernel point of
> > > view, it doesn't matter one bit what when you did it or if the state has
> > > changed before you resume. It's up to userspace to make sure the user
> > > doesn't do real work while the snapshot is being written to disk and
> > > machine is shut down.
>
> On Fri, 27 Apr 2007, Oliver Neukum wrote:
> > And where is the benefit in that? How is such user space freezing logic
> > simpler than having the kernel do the write?
> >
> > What can you do in user space if all filesystems are r/o that is worth the
> > hassle?
>
> I am talking about snapshot_system() here. It's not given that the
> filesystems need to be read-only (you can snapshot them too). The benefit
> here is that you can do whatever you want with the snapshot (encrypt,
> compress, send over the network) and have a clean well-defined interface
> in the kernel. In addition, aborting the snapshot is simpler, simply
> munmap() the snapshot.

Well, swsusp currently does almost the same, except that you can read the
image from the kernel as a stream of bytes, using read() and, during the
restore phase, upload the same image using write(). The advantage of this
is that the interface is symmetrical from the user space's point of view.
[You're cancelling the hibernation by closing /dev/snapshot, which also is
quite natural.]

If you look at the interface in user.c, there are only two ioctls really needed
for that in there, SNAPSHOT_ATOMIC_SNAPSHOT and
SNAPSHOT_ATOMIC_RESTORE. Two more are handy for freezing
tasks, SNAPSHOT_FREEZE and SNAPSHOT_UNFREEZE. The others were added
later, to make the user space part simpler or capable of doing some fancy
stuff, which I am ready to admit was a mistake.

> The problem with writing in the kernel is obvious: we need to add new code
> to the kernel for compression, encryption, and userspace interaction
> (graphical progress bar) that are important for user experience.

Yes, and that's why we wanted to introduce the userland part. The problem
with this approach, as it's turned out, is that the userland part must be a
very specialized piece of software, really careful of what it's doing, mainly
because of the inability to checkpoint filesystems. If we could checkpoint
filesystems and were able to unfreeze the user space after creating the
snapshot without the risk of corrupting filesystems in the restore phase,
the userland part could be much simpler (even as simple as Linus suggested).

Bodo Eggert

unread,

Apr 28, 2007, 7:10:13 AM4/28/07

to

Pavel Machek <pa...@ucw.cz> wrote:

>> I also don't like the idea of storing this in the swap partition for a
>> couple of reasons.
>>
>> 1. on many modern linux systems the swap partition is not large enough.
>>
>> for example, on my boxes with 16G or ram I only allocate 2G of swap
>> space
>
> WTF? So allocate larger swap partition. You just told me disks are big
> enough.

1) Repartitioning is sometimes not an option.
2) What happens, if the swap space gets used?

I want to be sure I can suspend my {server,laptop} in case of power running
out. Using swap is only an option for desktops.

>> 2. it's too easy for other things to stomp on your swap partition.
>>
>> for example: booting from a live CD that finds and uses swap
>> partitions
>
> That's a feature. If you are booting from live CD, you _want_ to erase
> any hibernation image.

NACK. You want to keep all partitions related to the hibernated system
read-only. That's completely different from destroying all your unsafed
data and possibly long-running tasks.
--
Top 100 things you don't want the sysadmin to say:
51. YEEEHA!!! What a CRASH!!!

Friß, Spammer: C...@rzlmn.7eggert.dyndns.org D9G...@Zk.7eggert.dyndns.org

Oliver Neukum

unread,

Apr 28, 2007, 9:40:09 AM4/28/07

to

Am Samstag, 28. April 2007 11:22 schrieb Pekka Enberg:
> Hi Oliver,
>
> Am Freitag, 27. April 2007 12:12 schrieb Pekka J Enberg:
> > > The problem with writing in the kernel is obvious: we need to add new code
> > > to the kernel for compression, encryption, and userspace interaction
> > > (graphical progress bar) that are important for user experience.
>
> On 4/27/07, Oliver Neukum <oli...@neukum.org> wrote:
> > The kernel can already do compression and encryption.
>
> Yes, if we all could agree on _which_ compression and encryption

Any of those available in the kernel. Where's the problem?

> algorithm(s) we want to use. It goes beyond that too, where do you
> want to save the image? In the swap device or a regular file? And

A swap device is doubtlessly easier. But isn't the problem of using
a swap file already fixed? The writeout seems the easiest part of
hibernation.

> don't forget about debuggability either. It's faster to do a
> snapshot/resume without shutdown/restart in the middle or just do a
> snapshot, and examine its contents.

Then use a "fake reboot" option and save the image to a ramdisk.
It isn't that hard. You must be able to survive that, as io errors during
write out are possible.

Regards
Oliver

Linus Torvalds

unread,

Apr 28, 2007, 2:01:50 PM4/28/07

to

On Sat, 28 Apr 2007, Pavel Machek wrote:
>
> We do not want kernel threads running:
>
> a) they may hold some locks and deadlock suspend
>
> b) they may do some writes to disk, leading to corruption

You're really just making both of those up.

If a kernel thread holds a lock and deadlocks suspend, that would deadlock
anythign else _too_. Suspend isn't *that* special. Everything it does are
things other people do too.

And no, kernel threads do not write to disk on their own. Name one. They
help *others* write to disk, but those disk writes need to happen.

The freezer has *caused* those deadlocks (eg by stopping threads that were
needed for the suspend writeouts to succeed!), not solved them.

So stop making these totally bogus arguments up.

Linus

Rafael J. Wysocki

unread,

Apr 28, 2007, 2:02:28 PM4/28/07

to

On Saturday, 28 April 2007 18:28, Linus Torvalds wrote:
>
> On Sat, 28 Apr 2007, Pavel Machek wrote:
> >
> > We do not want kernel threads running:
> >
> > a) they may hold some locks and deadlock suspend
> >
> > b) they may do some writes to disk, leading to corruption
>
> You're really just making both of those up.
>
> If a kernel thread holds a lock and deadlocks suspend, that would deadlock
> anythign else _too_. Suspend isn't *that* special. Everything it does are
> things other people do too.
>
> And no, kernel threads do not write to disk on their own. Name one.

xfssyncd , or at least it seems so at a quick look.

> They help *others* write to disk, but those disk writes need to happen.
>
> The freezer has *caused* those deadlocks (eg by stopping threads that were
> needed for the suspend writeouts to succeed!), not solved them.

I can't remember anything like this, but I believe you have a specific test
case in mind.

> So stop making these totally bogus arguments up.

Well, they may be bogus, but there's something else.

I have reviewed some kernel threads used by device drivers that currently are
frozen to see if it would be safe not to freeze them, and I'm worried.

What, for example, if such a thread schedules a timeout and waits for
something to happen (eg. the airo driver does something like this), but instead
the hibernation/suspend happens and the device is frozen/suspended under it?

Shouldn't the thread be notified by the driver's freeze/suspend callback?

Moreover, what if after the restore the device is not present (for example, it
may be a pcmcia card that the user has removed) and the thread is scheduled
before the device's unfreeze callback has a chance to run? Shouldn't the
thread check that the device is present? In that case it would have to be
notified by someone that the check is necessary, but who can do that?

Greetings,
Rafael

David Lang

unread,

Apr 28, 2007, 3:10:07 PM4/28/07

to

On Sat, 28 Apr 2007, Pavel Machek wrote:

>>
>> We freeze user space processes for the reasons that you have quoted above.
>>
>> Why we freeze kernel threads in there too is a good question, but not for me to
>> answer. I don't know. Pavel should know, I think.
>
> We do not want kernel threads running:
>
> a) they may hold some locks and deadlock suspend
>
> b) they may do some writes to disk, leading to corruption
>
> We could solve a) by carefully auditing suspend lock usage to make
> sure deadlocks are impossible even with kernel threads running.

remember that we are doing suspend-to-disk, after we do the snapshot we will be
doing a shutdown. that should simplify the locking issues

David Lang

unread,

Apr 28, 2007, 3:10:10 PM4/28/07

to

On Sat, 28 Apr 2007, Oliver Neukum wrote:

> Am Samstag, 28. April 2007 01:50 schrieb David Lang:
>> 3. make mounted filesystems read-only (possibly with snapshot/checkpoint)
>> 4. unpause
>> 5. save image (with full userspace available, including network)
>> 6. shutdown system (throw away all userspace memory, no need to do graceful
>> shutdown or nice kill signals, revert filesystem to snapshot/checkpoint if
>> needed)
>
> And then you'll have people wonder why the server which sent out all
> those files has no log entries. You'd have to selectively unfreeze user
> space, which is a cure worse than the desease.
>
> Simply throwing away user space work is a bug. And no, you cannot say that
> it'll be redone away, as you are throwing away accepted input, too.

when you are doing a suspend-to-disk I disagree with you. whoever is doing the
suspend knows what is going on, and they can decide what needs to be done.

the only case where you have 'unexpected' work being thrown away is if you are
suspending a network server, and the process of suspending it is going to cut
all the network connections anyway so it's not a seamless process. In this case
it's fair to let the sysadmin choose between loosing some logs or doing some
other step to prevent this from happening (which could be to shutdown the
network service, or load a iptables rule to block the service)

however, most of the uses of suspend-to-disk are going to be single-user
machines and in that case telling the user that anything that they do after
issuing the suspend is going to be lost is a perfectly sane thing to do.

and for that matter, if the snapshot is cheap enough, some people may choose to
cron the snapshot portion of a suspend-to-disk evvery few min as a safety net
for something going wrong. In this case they really do want all of userspace to
keep working after the snapshot.

David Lang

Bill Davidsen

unread,

Apr 28, 2007, 3:10:12 PM4/28/07

to

Nigel Cunningham wrote:

> Please, go apply that logic elsewhere, then cut out (or at least stop
> adding) support for users with less common needs in other areas. I fully
> acknowledge that most users have only one place to store their image and
> it's a swap device. But that doesn't mean one size fits all.
>
I think to some extent that's part of the problem. Consider for a moment
that a /dev/hibernate would be required, and that it must be (a) a disk,
or (b) a partition, or (c) other devices in the future, like an nbd, USB
flash or DVD.

Don't have a device like that, then can't hibernate. Stop trying to be
smart and use swap for two different things. Stop trying to have an
interface between user space and kernel which does things not required
to preserve the system. A progress indicator is not needed, power off is
my progress indicator, and should be the sole valid end of a hibernate.

> A full image implies that you need to figure out what's not going to
> change while you're writing it and save that separately. At the moment,
> I'm treating most of the LRU contents as that list. If we're going to
> start trying to let every man and his dog run while we're trying to
> snapshot the system, that's not going to work anymore - or the logic
> will get a lot more complicated.
>
> Sorry. I never thought I'd say this, but I think you're being naive
> about how simple the process of snapshotting a system is.

Hibernate is useful to avoid complex boot, it's useful as the UPS gets
tired, and putting features in the process beyond saving the snap
(possibly compressed and/or encrypted) just adds complexity. Put it all
in the kernel and use /sys/power/state as the user interface. Stop
oversolving the problem.

No, that doesn't avoid other hard issues, but for the most part suspend2
has addressed them.

--
Bill Davidsen <davi...@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

Rafael J. Wysocki

unread,

Apr 28, 2007, 3:20:05 PM4/28/07

to

On Saturday, 28 April 2007 20:32, David Lang wrote:
> On Sat, 28 Apr 2007, Pavel Machek wrote:
>
> >>
> >> We freeze user space processes for the reasons that you have quoted above.
> >>
> >> Why we freeze kernel threads in there too is a good question, but not for me to
> >> answer. I don't know. Pavel should know, I think.
> >
> > We do not want kernel threads running:
> >
> > a) they may hold some locks and deadlock suspend
> >
> > b) they may do some writes to disk, leading to corruption
> >
> > We could solve a) by carefully auditing suspend lock usage to make
> > sure deadlocks are impossible even with kernel threads running.
>
> remember that we are doing suspend-to-disk, after we do the snapshot we will be
> doing a shutdown. that should simplify the locking issues

That's assuming that we won't need to cancel the hibernation.

Greetings,
Rafael