Back to the future.

3 views
Skip to first unread message

Nigel Cunningham

unread,
Apr 26, 2007, 2:10:10 AM4/26/07
to
Hi again.

So - trying to get back to the original discussion - what (if anything)
do you see as the way ahead?

The options I can think of are (starting with things I can do):

1) I stop developing Suspend2, thereby pushing however many current
Suspend2 users to move to [u]swsusp and seek to get that up to speed.

2) I quit my day job, see if Redhat will take me full time and give me
the time to start trying to merge Suspend2 bit by bit. Alternatively,
days suddenly become 8 hours longer and I discover the boundless energy
and alertness needed to do this too :). Ok. Not going to happen.

3) Someone else steps up to the plate and tries to merge Suspend2 one
bit at a time.

4) uswsusp and swsusp get dropped and Suspend2 goes into mainline.

5) Everything gets dropped and we start from scratch.

6) The status quo - or some small variant of it - stays. Oh... I said
"way ahead". I guess that rules this one out, even though I'll be very
surprised if it's not the one that wins out.

7) Suspend2 gets merged and people get to choose which they like better.
Nearly forgot this as a conceivable possibility. Yeah, I know you said
you don't want it. I'm just trying to think of what might possibly
happen.

N.

signature.asc

Pekka Enberg

unread,
Apr 26, 2007, 3:40:06 AM4/26/07
to
On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> 3) Someone else steps up to the plate and tries to merge Suspend2 one
> bit at a time.

So which bits do we want to merge? For example, Suspend2
kernel/power/ui.c, kernel/power/compression.c, and
kernel/power/encryption.c seem pointless now that we have uswsusp.
Furthermore, being the shameless Linus cheerleader that I am, I got
the impression that we should fix the snapshot/shutdown logic in the
kernel which Suspend2 doesn't really address?

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Nigel Cunningham

unread,
Apr 26, 2007, 3:50:08 AM4/26/07
to
Hi.

On Thu, 2007-04-26 at 10:28 +0300, Pekka Enberg wrote:
> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> > 3) Someone else steps up to the plate and tries to merge Suspend2 one
> > bit at a time.
>
> So which bits do we want to merge? For example, Suspend2
> kernel/power/ui.c, kernel/power/compression.c, and
> kernel/power/encryption.c seem pointless now that we have uswsusp.
> Furthermore, being the shameless Linus cheerleader that I am, I got
> the impression that we should fix the snapshot/shutdown logic in the
> kernel which Suspend2 doesn't really address?

I agree that the driver logic could be addressed too, but to answer your
question...

* Doing things in the right order? (Prepare the image, then do the
atomic copy, then save).
* Mulithreaded I/O (might as well use multiple cores to compress the
image, now that we're hotplugging later).
* Support for > 1 swap device.
* Support for ordinary files.
* Full image option.
* Modular design?

Regards,

Nigel

signature.asc

Pekka Enberg

unread,
Apr 26, 2007, 4:30:09 AM4/26/07
to
Hi Nigel,

On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> * Doing things in the right order? (Prepare the image, then do the
> atomic copy, then save).

As I am a total newbie to the power management code, I am unable to
spot the conceptual difference in uswsusp suspend.c:suspend_system()
and suspend2 kernel/power/suspend.c:suspend_main(). How are they
different?

On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> * Mulithreaded I/O (might as well use multiple cores to compress the
> image, now that we're hotplugging later).

I assume this doesn't affect the kernel at all with uswsusp?

On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:

> * Modular design?

This is too broad. Please be more specific of the problems the current
suspend and snapshot/shutdown code in the kernel has.

Now to add to your list, as far as I can tell, suspend2 provides
better feedback to the user via the netlink mechanism (although the
kernel shouldn't be sending messages such as userui_redraw but instead
let the userspace know of the actual events, for example, that tasks
have now been frozen). However, I am unsure if this is still relevant
as most of the work (snapshot writing) is being done in userspace
where we explicitly know when processes have been frozen, when the
snapshot is finished, and when it's written to disk.

Jan Engelhardt

unread,
Apr 26, 2007, 4:50:12 AM4/26/07
to

On Apr 26 2007 16:04, Nigel Cunningham wrote:
>
>Hi again.
>
>So - trying to get back to the original discussion - what (if anything)
>do you see as the way ahead?
>
>The options I can think of are (starting with things I can do):
>
>1) [...]
>2) [...]
>3) [...]
>4) [...]
>5) [...]
>6) [...]
>7) [...]

Perhaps do it the EVMS way? Do as much in userspace as possible, and
trying having a simple kernel API at the same time.
Perhaps (3) would be it, but ask Redhat _first_ before quitting anything :)


Jan
--

Nigel Cunningham

unread,
Apr 26, 2007, 9:52:44 AM4/26/07
to
Hi.

On Thu, 2007-04-26 at 11:17 +0300, Pekka Enberg wrote:
> Hi Nigel,
>
> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
>
> As I am a total newbie to the power management code, I am unable to
> spot the conceptual difference in uswsusp suspend.c:suspend_system()
> and suspend2 kernel/power/suspend.c:suspend_main(). How are they
> different?

Will discuss in irc since you've appeared there...

> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> > * Mulithreaded I/O (might as well use multiple cores to compress the
> > image, now that we're hotplugging later).
>
> I assume this doesn't affect the kernel at all with uswsusp?

Well uswsusp would benefit from using multiple threads - if it can - to
do the work. I saw quite an improvement from implementing it.

> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
> > * Modular design?
>
> This is too broad. Please be more specific of the problems the current
> suspend and snapshot/shutdown code in the kernel has.

Did you see the 'Reasons to merge' email I sent? It has more detail on
this.

> Now to add to your list, as far as I can tell, suspend2 provides
> better feedback to the user via the netlink mechanism (although the
> kernel shouldn't be sending messages such as userui_redraw but instead
> let the userspace know of the actual events, for example, that tasks
> have now been frozen). However, I am unsure if this is still relevant
> as most of the work (snapshot writing) is being done in userspace
> where we explicitly know when processes have been frozen, when the
> snapshot is finished, and when it's written to disk.

From uswsusp's point of view, yeah. But I'm still coming from the 'doing
this in kernelspace makes far more sense' perspective.

Regards,

Nigel

signature.asc

Nigel Cunningham

unread,
Apr 26, 2007, 9:57:53 AM4/26/07
to
Hi.

On Thu, 2007-04-26 at 10:38 +0200, Jan Engelhardt wrote:
> On Apr 26 2007 16:04, Nigel Cunningham wrote:
> >
> >Hi again.
> >
> >So - trying to get back to the original discussion - what (if anything)
> >do you see as the way ahead?
> >
> >The options I can think of are (starting with things I can do):
> >
> >1) [...]
> >2) [...]
> >3) [...]
> >4) [...]
> >5) [...]
> >6) [...]
> >7) [...]
>
> Perhaps do it the EVMS way? Do as much in userspace as possible, and
> trying having a simple kernel API at the same time.
> Perhaps (3) would be it, but ask Redhat _first_ before quitting anything :)

:) Well, the EVMS way is swsusp. Personally, I agree with Linus that
think putting suspend to disk code in userspace is just a broken idea.

Regards,

Nigel

signature.asc

Linus Torvalds

unread,
Apr 26, 2007, 1:00:10 PM4/26/07
to

On Thu, 26 Apr 2007, Nigel Cunningham wrote:
>
> * Doing things in the right order? (Prepare the image, then do the
> atomic copy, then save).

I'd actually like to discuss this a bit..

I'm obviously not a huge fan of the whole user/kernel level split and
interfaces, but I actually do think that there is *one* split that makes
sense:

- generate the (whole) snapshot image entirely inside the kernel

- do nothing else (ie no IO at all), and just export it as a single image
to user space (literally just mapping the pages into user space).
*one* interface. None of the "pretty UI update" crap. Just a single
system call:

void *snapshot_system(u32 *size);

which will map in the snapshot, return the mapped address and the size
(and if you want to support snapshots > 4GB, be my guest, but I suspect
you're actually *better* off just admitting that if you cannot shrink
the snapshot to less than 32 bits, it's not worth doing)

User space gets a fully running system, with that one process having that
one image mapped into its address space. It can then compress/write/do
whatever to that snapshot.

You need one other system call, of course, which is

int resume_snapshot(void *snapshot, u32 size);

and for testing, you should be able to basically do

u32 size;
void *buffer = snapshot_system(&size);
if (buffer != MAP_FAILED)
resume_snapshot(buffer, size);

and it should obviously work.

And btw, the device model changes are a big part of this. Because I don't
think it's even remotely debuggable with the full suspend/resume of the
devices being part of generating the image! That freeze/snapshot/unfreeze
sequence is likely a lot more debuggable, if only because freeze/unfreeze
is actually a no-op for most devices, and snapshotting is trivial too.

Once you have that snapshot image in user space you can do anything you
want. And again: you'd hav a fully working system: not any degradation
*at*all*. If you're in X, then X will continue running etc even after the
snapshotting, although obviously the snapshotting will have tried to page
a lot of stuff out in order to make the snapshot smaller, so you'll likely
be crawling.

> * Mulithreaded I/O (might as well use multiple cores to compress the
> image, now that we're hotplugging later).
> * Support for > 1 swap device.
> * Support for ordinary files.
> * Full image option.
> * Modular design?

I'd really suggest _just_ the "full image". Nothing else is probably ever
worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as
"echo disk > /sys/power/state", but it should not necessarily be much
worse than

snapshot_kernel | gzip -9 > /dev/snapshot

either (and resuming from the snapshot would just be the reverse)!

And if you want to send the snapshot over a TCP connection to another
host, be my guest. With pretty images while it's transferring. Whatever.

Linus

Xavier Bestel

unread,
Apr 26, 2007, 1:10:07 PM4/26/07
to
On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting

Won't there be problems if e.g. X tries to write something to its
logfile after snapshot ?

Xav

Linus Torvalds

unread,
Apr 26, 2007, 1:20:14 PM4/26/07
to

On Thu, 26 Apr 2007, Linus Torvalds wrote:
>
> Once you have that snapshot image in user space you can do anything you
> want.

Side note: the exception, of course, is page out more. The swap device has
to be read-only.

We actually have support for that mode (it's how "swapoff" works: it marks
swap devices as not accepting _new_ entries, even though old entries are
still valid). So you can have a fully running system, with 99% of memory
swapped out, and still guarantee that you won't swap out anything *more*
(which would destroy the swap image, which you don't want, since it's
where a lot of the memory may end up being, in order to make the snapshot
itself as small as possible)!

Anybody who cares can look at the code that messes with the the
SWP_WRITEOK flag. You'd basically swap out enough to make the snapshot
image fit comfortably in memory, and then you'd clear SWP_WRITEOK on all
swap devices and return to user space. Or something very close to that.

But the point here is that we should actually really be able to have a
fully working system, even _after_ we created the snapshot. I don't even
think you should need any "initrd only" kind of situation.

If somebody can do that, with just those two system calls, I'll remove
every other suspend-to-disk wannabe from the kernel in a heartbeat. I may
have missed something subtle, of course, but I really *think* it should be
doable.

Linus Torvalds

unread,
Apr 26, 2007, 1:40:11 PM4/26/07
to

On Thu, 26 Apr 2007, Xavier Bestel wrote:
>
> Won't there be problems if e.g. X tries to write something to its
> logfile after snapshot ?

Sure. But that's a user-level issue.

You do have to allow writing after snapshotting, since at a minimum, you'd
want the snapshot itself to be written. So the kernel has to be fully
running, and support full user space. No "degraded mode" like now.

So when I said "fully running user mode", I really meant it from the
perspective of the kernel - not necessarily from the perspective of the
"user". You do want to limit _what_ user mode does, but you must not limit
it by making the kernel less capable.

Remounting mounted filesystems read-only sounds like a good idea, for
example. We can do that. We have the technology. But we shouldn't limit
user space from doing other things (for example, it might want to actually
*mount* a new filesystem for writing the snapshot image).

For example, right now we try to "fix" that with the whole process freezer
thing. And that code has *caused* more problems than it fixed, since it
tries to freeze all the kernel threads etc, and you simply don't have a
truly *working*system*.

I think it's fine to freeze processes if that is what you want to do (for
example, send them SIGSTOP), but we freeze them *too*much* right now, and
the suspend stuff has taken over policy way too much. We don't actually
leave the system in a runnable state. I can almost guarantee that you'd be
*better* off having the snapshot taking thing do a

kill(-1, SIGSTOP);

in user space than our current broken process freezer. At least that
wouldn't have screwed up those kernel threads as badly as swsusp did.

And no, I'm not saying that my suggestion is the only way to do it. Go
wild. But the *current* situation is just broken. Three different things,
none of which people can agree on. I'd *much* rather see a conceptually
simpler approach that then required, but even more important is that right
now people aren't even discussing alternatives, they're just pushing one
of the three existing things, and that's simply not viable. Because I'm
not merging another one.

In fact, I personally feel that I shouldn't even have merged
userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
feelings simply don't matter that much. I have to trust people. But yes,
as far as *I* am personally concerned, I think it was a mistake to merge
it.

Linus

Luca Tettamanti

unread,
Apr 26, 2007, 1:40:17 PM4/26/07
to
Nigel Cunningham <ni...@nigel.suspend2.net> ha scritto:

> On Thu, 2007-04-26 at 11:17 +0300, Pekka Enberg wrote:
>> On 4/26/07, Nigel Cunningham <ni...@nigel.suspend2.net> wrote:
>> > * Mulithreaded I/O (might as well use multiple cores to compress the
>> > image, now that we're hotplugging later).
>>
>> I assume this doesn't affect the kernel at all with uswsusp?
>
> Well uswsusp would benefit from using multiple threads - if it can - to
> do the work. I saw quite an improvement from implementing it.

It's doable[1], but I'm not sure that the added complexity is worth of it.
I'm suprised that you see a big improvement. I'd expect that the image
write is bottlenecked by the disk performance. On my PC (Core2, locked
at 1.6GHz) lzf can compress 250-280MB/s; even with an older CPU that can
do 1/3 it's still more than the disk can handle.

Luca
[1] We may even use MPI to compress over a Beowulf cluster, it's
userspace ;)
--
"Ricorda sempre che sei unico, esattamente come tutti gli altri".

Chase Venters

unread,
Apr 26, 2007, 3:20:12 PM4/26/07
to
On Thu, 26 Apr 2007, Linus Torvalds wrote:

>
> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting, although obviously the snapshotting will have tried to page
> a lot of stuff out in order to make the snapshot smaller, so you'll likely
> be crawling.
>

In fact... If you're just paging out to make a smaller snapshot (ie, not
to free up memory), couldn't you just swap it out (if it's not backed by a
file) then mark it as "half-released"... ie, the snapshot writing code
ignores it knowing that it will be available on disk at resume, but then
when the snapshot is complete it's still available in physical RAM,
preventing user-space from crawling due to the necessity of paging it all
back in?

Thanks,
Chase

David Lang

unread,
Apr 26, 2007, 3:30:14 PM4/26/07
to
On Thu, 26 Apr 2007, Chase Venters wrote:

> On Thu, 26 Apr 2007, Linus Torvalds wrote:
>
>>
>> Once you have that snapshot image in user space you can do anything you
>> want. And again: you'd hav a fully working system: not any degradation
>> *at*all*. If you're in X, then X will continue running etc even after the
>> snapshotting, although obviously the snapshotting will have tried to page
>> a lot of stuff out in order to make the snapshot smaller, so you'll likely
>> be crawling.
>>
>
> In fact... If you're just paging out to make a smaller snapshot (ie, not
> to free up memory), couldn't you just swap it out (if it's not backed by a
> file) then mark it as "half-released"... ie, the snapshot writing code
> ignores it knowing that it will be available on disk at resume, but then
> when the snapshot is complete it's still available in physical RAM,
> preventing user-space from crawling due to the necessity of paging it all
> back in?

your swap space may end up being re-used before you restore with std

David Lang

Nigel Cunningham

unread,
Apr 26, 2007, 4:00:14 PM4/26/07
to
Hi.

On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
>
> On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> >
> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
>
> I'd actually like to discuss this a bit..
>
> I'm obviously not a huge fan of the whole user/kernel level split and
> interfaces, but I actually do think that there is *one* split that makes
> sense:
>
> - generate the (whole) snapshot image entirely inside the kernel
>
> - do nothing else (ie no IO at all), and just export it as a single image
> to user space (literally just mapping the pages into user space).
> *one* interface. None of the "pretty UI update" crap. Just a single
> system call:
>
> void *snapshot_system(u32 *size);
>
> which will map in the snapshot, return the mapped address and the size
> (and if you want to support snapshots > 4GB, be my guest, but I suspect
> you're actually *better* off just admitting that if you cannot shrink
> the snapshot to less than 32 bits, it's not worth doing)

That inherently limits the image to half of available ram (you need
somewhere to store the snapshot), so you won't get the full image you
express interest in below.

> User space gets a fully running system, with that one process having that
> one image mapped into its address space. It can then compress/write/do
> whatever to that snapshot.

You're describing uswsusp! (At least in so far as I understand it!).

You can't get a fully running system though, because if anything changes
something on disk that was snapshotted (super blocks etc) your snapshot
is invalid and you risk on-disk corruption.

> And btw, the device model changes are a big part of this. Because I don't
> think it's even remotely debuggable with the full suspend/resume of the
> devices being part of generating the image! That freeze/snapshot/unfreeze
> sequence is likely a lot more debuggable, if only because freeze/unfreeze
> is actually a no-op for most devices, and snapshotting is trivial too.
>
> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting, although obviously the snapshotting will have tried to page
> a lot of stuff out in order to make the snapshot smaller, so you'll likely
> be crawling.

Nooooooo! See above about disk corruption.

> > * Mulithreaded I/O (might as well use multiple cores to compress the
> > image, now that we're hotplugging later).
> > * Support for > 1 swap device.
> > * Support for ordinary files.
> > * Full image option.
> > * Modular design?
>
> I'd really suggest _just_ the "full image". Nothing else is probably ever
> worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as
> "echo disk > /sys/power/state", but it should not necessarily be much
> worse than

Please, go apply that logic elsewhere, then cut out (or at least stop
adding) support for users with less common needs in other areas. I fully
acknowledge that most users have only one place to store their image and
it's a swap device. But that doesn't mean one size fits all.

A full image implies that you need to figure out what's not going to
change while you're writing it and save that separately. At the moment,
I'm treating most of the LRU contents as that list. If we're going to
start trying to let every man and his dog run while we're trying to
snapshot the system, that's not going to work anymore - or the logic
will get a lot more complicated.

Sorry. I never thought I'd say this, but I think you're being naive
about how simple the process of snapshotting a system is.

Regards,

Nigel

signature.asc

Nigel Cunningham

unread,
Apr 26, 2007, 4:10:13 PM4/26/07
to
Hi.

On Thu, 2007-04-26 at 10:34 -0700, Linus Torvalds wrote:
>
> On Thu, 26 Apr 2007, Xavier Bestel wrote:
> >
> > Won't there be problems if e.g. X tries to write something to its
> > logfile after snapshot ?
>
> Sure. But that's a user-level issue.
>
> You do have to allow writing after snapshotting, since at a minimum, you'd
> want the snapshot itself to be written. So the kernel has to be fully
> running, and support full user space. No "degraded mode" like now.

It doesn't need a fully functional userspace (unless you want to write
to a fuse device, and even then that could be worked around - make it
like uswsusp or userui).... can I deverge for a second and say that from
this point of view, fuse is the lamest idea ever invented. Guaranteed to
break your ability to suspend^Wsnapshot.... Anyhow, if the kernel has
bmapped the pages it's going to write to beforehand, it knows where the
image needs to go. No need for userspace at all.

> So when I said "fully running user mode", I really meant it from the
> perspective of the kernel - not necessarily from the perspective of the
> "user". You do want to limit _what_ user mode does, but you must not limit
> it by making the kernel less capable.
>
> Remounting mounted filesystems read-only sounds like a good idea, for
> example. We can do that. We have the technology. But we shouldn't limit
> user space from doing other things (for example, it might want to actually
> *mount* a new filesystem for writing the snapshot image).

We tried that. It would need some work. IIRC remounting filesystems
read-only makes files become marked read-only. Perfectly sensible,
except that if you then remount the filesystem rw at resume time, all
those files are still marked ro and userspace crashes and burns. Not
unfixable, I'll agree, but there is more work to do there.

As to the example, mounting a new filesystem for writing the snapshot
image should probably be done before we do the snapshot. Then it won't
be in danger of triggering anything that might require one of the other
fses to be rw (eg syslog).

> For example, right now we try to "fix" that with the whole process freezer
> thing. And that code has *caused* more problems than it fixed, since it
> tries to freeze all the kernel threads etc, and you simply don't have a
> truly *working*system*.

Yes, it has been difficult. But so is bringing up a child.

> I think it's fine to freeze processes if that is what you want to do (for
> example, send them SIGSTOP), but we freeze them *too*much* right now, and
> the suspend stuff has taken over policy way too much. We don't actually
> leave the system in a runnable state. I can almost guarantee that you'd be
> *better* off having the snapshot taking thing do a
>
> kill(-1, SIGSTOP);
>
> in user space than our current broken process freezer. At least that
> wouldn't have screwed up those kernel threads as badly as swsusp did.

I don't think it's fair to blame swsusp there. Maybe cpu hotplugging...

> And no, I'm not saying that my suggestion is the only way to do it. Go
> wild. But the *current* situation is just broken. Three different things,
> none of which people can agree on. I'd *much* rather see a conceptually
> simpler approach that then required, but even more important is that right
> now people aren't even discussing alternatives, they're just pushing one
> of the three existing things, and that's simply not viable. Because I'm
> not merging another one.
>
> In fact, I personally feel that I shouldn't even have merged
> userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
> feelings simply don't matter that much. I have to trust people. But yes,
> as far as *I* am personally concerned, I think it was a mistake to merge
> it.

Perhaps you should try to make an alternative yourself instead of
pushing us into making something we don't believe will work (my case) or
have already done but in a way you don't like (Rafael). Don't talk about
Pavel cutting code. He's just acking/nacking what Rafael sends him.

Nigel

signature.asc

Linus Torvalds

unread,
Apr 26, 2007, 4:50:16 PM4/26/07
to

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
>
> Perhaps you should try to make an alternative yourself instead of
> pushing us into making something we don't believe will work (my case) or
> have already done but in a way you don't like (Rafael). Don't talk about
> Pavel cutting code. He's just acking/nacking what Rafael sends him.

I've done that in the past (USB, PCMCIA - screw the maintainers, redo
it basically from scratch). But the thing is, I'm totally uninterested
personally in the whole disk-snapshotting, so I'm not likely to do it
there.

But yes, I'm actually hoping that some new person will come in with a new
idea. The current people seem to be too set in "their" corners, and I
don't expect that to really change.

Quite honestly, I don't foresee any of the current tree approaches really
doing something new and obviously better, unless somebody new steps in.

Nigel Cunningham

unread,
Apr 26, 2007, 5:00:09 PM4/26/07
to
Hi.

On Thu, 2007-04-26 at 13:45 -0700, Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> >
> > Perhaps you should try to make an alternative yourself instead of
> > pushing us into making something we don't believe will work (my case) or
> > have already done but in a way you don't like (Rafael). Don't talk about
> > Pavel cutting code. He's just acking/nacking what Rafael sends him.
>
> I've done that in the past (USB, PCMCIA - screw the maintainers, redo
> it basically from scratch). But the thing is, I'm totally uninterested
> personally in the whole disk-snapshotting, so I'm not likely to do it
> there.
>
> But yes, I'm actually hoping that some new person will come in with a new
> idea. The current people seem to be too set in "their" corners, and I
> don't expect that to really change.
>
> Quite honestly, I don't foresee any of the current tree approaches really
> doing something new and obviously better, unless somebody new steps in.

That's because there is no other possibility. Sooner or later you have
to do a snapshot, and somehow you have to save it. You're not going to
get a new solution, just one that do those basic things in new and
better ways.

I'm perfectly willing to think through some alternate approach if you
suggest something or prod my thinking in a new direction, but I'm afraid
I just can't see right now how we can achieve what you're after.

Nigel

signature.asc

Theodore Tso

unread,
Apr 26, 2007, 5:40:40 PM4/26/07
to
On Fri, Apr 27, 2007 at 06:08:01AM +1000, Nigel Cunningham wrote:
> We tried that. It would need some work. IIRC remounting filesystems
> read-only makes files become marked read-only. Perfectly sensible,
> except that if you then remount the filesystem rw at resume time, all
> those files are still marked ro and userspace crashes and burns. Not
> unfixable, I'll agree, but there is more work to do there.

There are other solutions, though. One is that we could export a
system call interface which freezes a filesystem and prevents any
further I/O. We mostly have something like that right now (via the
the write_super_lockfs function in the superblock operations
structure), but we haven't exported it to userspace. And right now
not all filesystems support it, but in theory that could be fixed (or
you only suppor suspend/resume if all filesystems support lockfs).

We would also need a similar interface to freeze any block device I/O,
in case you have a database running and doing direct I/O to a block
device. (Or again, we could simply not support that case; how many
people will be running running a database accessing a block deivce on
their laptop?)

So in order to do this right, we would have to double the number of
new interfaces needed from the two proposed by Linus --- which is why
I think the userspace suspend solution is fundamentally NOT the right
one. Rather the right one is the one which Linux ultimately used for
PCMCIA, which is to do it all in the kernel.

- Ted

Rafael J. Wysocki

unread,
Apr 26, 2007, 6:10:34 PM4/26/07
to
On Thursday, 26 April 2007 22:08, Nigel Cunningham wrote:
[--snip--]

> > And no, I'm not saying that my suggestion is the only way to do it. Go
> > wild. But the *current* situation is just broken. Three different things,
> > none of which people can agree on. I'd *much* rather see a conceptually
> > simpler approach that then required, but even more important is that right
> > now people aren't even discussing alternatives, they're just pushing one
> > of the three existing things, and that's simply not viable. Because I'm
> > not merging another one.
> >
> > In fact, I personally feel that I shouldn't even have merged
> > userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
> > feelings simply don't matter that much. I have to trust people. But yes,
> > as far as *I* am personally concerned, I think it was a mistake to merge
> > it.
>
> Perhaps you should try to make an alternative yourself instead of
> pushing us into making something we don't believe will work (my case) or
> have already done but in a way you don't like (Rafael). Don't talk about
> Pavel cutting code. He's just acking/nacking what Rafael sends him.

Well, I think that much of what Linus is saying indicates that he hasn't tried
to write any such thing himself. ;-)

Anyway, I'm tired of all this thing. Really. I've just been trying to make
things _work_ more-or-less reliably in a way that Pavel liked and I really
didn't know that much about the kernel when I started. In fact, I started as a
user who needed certain functionality from the kernel and that was not there
at the time. I've made some mistakes because of that (like the definitions of
the ioctl numbers in suspend.h - this was just a rookie mistake, and I'm
ashamed of it, but _nobody_ catched it, although I believe many people were
looking at the patch).

Now that I know much more than before, I can say I agree with Linus on his
opinion about the separation of s2ram form the snapshot/restore functionality
(I'll call it 'hibernation' for simplicity from now on). It should be done,
because it would make things simpler and cleaner. Still, it will be difficult
to do without screwing users en masse and that's my main concern here.

I don't agree that we don't need the tasks freezer for suspending and
hibernation. We need it, because we need to be sure that the (other) tasks
will not get us in the way, and that also applies to kernel threads (and I
don't think the tasks freezer is 'screwing' them, BTW).

I agree that the userland interface for swsusp is not very nice and I'm going
to do my best to clean that up. I hope that someone will help me, but if not,
then that's fine. OTOH, it's difficult, if not impossible, to do a
userland-driven hibernation in a completely clean way. I've tried that and I'm
not exactly satisfied with the result, although it works and some distros use
it. I wouldn't have done it again, but then I'm going to support the existing
users, as I promised.

Now, I think that the hibernation should better be done completely in the
kernel, because that's just conceptually simpler, although some data exchange
with the user land may be acceptable for some optional fancy stuff. I'm also
tierd of the endless "to merge or not to merge suspend2" discussions that just
lead to nowhere. For these reasons I declare that I'm ready to cooperate with
Nigel on integrating as much of suspend2 as reasonably possible into the
existing infrastructure, under the following conditions:
- we don't remove the existing user-visible interfaces
- we work on one piece of code at a time
- we avoid code duplication, as much as possible
- we avoid using open-coded things, if possible
- if we don't agree on something, we ask someone wiser (volunteers welcome ;-))

If that's acceptable, we can start tomorrow. In the process, we can try to
separate the hibernation code paths from the s2ram ones, but that will require
a lot of knowledge about things that neither me nor Nigel, AFAICT, are very
familiar with, like writing device drivers.

Greetings,
Rafael

Nigel Cunningham

unread,
Apr 26, 2007, 6:30:16 PM4/26/07
to
Hi Rafael.

I don't want to remove user visible interfaces either (I understand that
you mean the ioctls by that?). Perhaps we can find a way to make them
still usable with a more in-kernel solution (ie some things become
noops?).

> - we work on one piece of code at a time

Sure. We should spend some time discussing and planning beforehand so we
don't waste time and effort writing and rewriting.

> - we avoid code duplication, as much as possible

No problem there.

> - we avoid using open-coded things, if possible

Regarding open-coded things, I assume you're referring to the extents. I
would argue that they're not open-coded because list.h implements doubly
linked lists, and extents use a singly linked list. That said, I suppose
we could make the extents doubly linked and use list.h, even though that
would be a waste of 4/8 bytes per extent.

> - if we don't agree on something, we ask someone wiser (volunteers welcome ;-))

Absolutely!

> If that's acceptable, we can start tomorrow. In the process, we can try to
> separate the hibernation code paths from the s2ram ones, but that will require
> a lot of knowledge about things that neither me nor Nigel, AFAICT, are very
> familiar with, like writing device drivers.

Yes.

Thanks for this email. It's really encouraging, and I'm more than glad
to work with you. Unfortunately, as you've seen me keep saying already,
I have very limited time to work on this. Thankfully you seem to have
more, and Pekka has also stepped up to help, so maybe we can make good
forward progress despite my limitations.

Regards,

Nigel

signature.asc

Pavel Machek

unread,
Apr 26, 2007, 6:50:09 PM4/26/07
to
Hi!

> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
>
> I'd actually like to discuss this a bit..
>
> I'm obviously not a huge fan of the whole user/kernel level split and
> interfaces, but I actually do think that there is *one* split that makes
> sense:
>
> - generate the (whole) snapshot image entirely inside the kernel
>
> - do nothing else (ie no IO at all), and just export it as a single image
> to user space (literally just mapping the pages into user space).
> *one* interface. None of the "pretty UI update" crap. Just a single
> system call:
>
> void *snapshot_system(u32 *size);
>
> which will map in the snapshot, return the mapped address and the size
> (and if you want to support snapshots > 4GB, be my guest, but I suspect
> you're actually *better* off just admitting that if you cannot shrink
> the snapshot to less than 32 bits, it's not worth doing)

This is basically how uswsusp is designed. (We do not use system call,
you just read from /dev/snapshot, and you have to make few ioctls to
stop the other tasks).

> and for testing, you should be able to basically do
>
> u32 size;
> void *buffer = snapshot_system(&size);
> if (buffer != MAP_FAILED)
> resume_snapshot(buffer, size);
>
> and it should obviously work.

Which is what I did long time ago, during uswsusp development.

> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting, although obviously the snapshotting will have tried to page
> a lot of stuff out in order to make the snapshot smaller, so you'll likely
> be crawling.

Well... We decided not to do this in the fully working system. SIGSTOP
is just not strong enough, and we want the snapshot atomic.

Now, it would be _very_ nice to be able to snapshot system and
continue running, but I just don't see how to do it without extensive
filesystem support.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Pavel Machek

unread,
Apr 26, 2007, 6:50:13 PM4/26/07
to
Hi!

> I'd really suggest _just_ the "full image". Nothing else is probably ever
> worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as
> "echo disk > /sys/power/state", but it should not necessarily be much
> worse than
>
> snapshot_kernel | gzip -9 > /dev/snapshot

Yep, we "freeze too much", so we can't just use the shell and pipe
it. Too bad.

218 int write_image(char *resume_dev_name)
219 {
220 static struct swap_map_handle handle;
221 struct swsusp_info *header;
222 unsigned long start;
223 int fd;
224 int error;
225
226 fd = open(resume_dev_name, O_RDWR | O_SYNC);
227 if (fd < 0) {
228 printf("suspend: Could not open resume device\n");
229 return error;
230 }
231 error = read(dev, buffer, PAGE_SIZE);
232 if (error < PAGE_SIZE)
233 return error < 0 ? error : -EFAULT;
234 header = (struct swsusp_info *)buffer;
235 if (!enough_swap(header->pages)) {
236 printf("suspend: Not enough free swap\n");
237 return -ENOSPC;
238 }
239 error = init_swap_writer(&handle, fd);
240 if (!error) {
241 start = handle.cur_swap;
242 error = swap_write_page(&handle, header);
243 }
244 if (!error)
245 error = save_image(&handle, header->pages - 1);
246 if (!error) {
247 flush_swap_writer(&handle);
248 printf( "S" );
249 error = mark_swap(fd, start);
250 printf( "|\n" );
251 }
252 fsync(fd);
253 close(fd);
254 return error;
255 }

This is basically the loop above, made complex by the fact that we do
not want to have separate partition for snapshot; we just want to
reuse free space in swap partition.

I think you've just invented uswsusp.

David Lang

unread,
Apr 26, 2007, 7:00:10 PM4/26/07
to
On Fri, 27 Apr 2007, Pavel Machek wrote:

> This is basically the loop above, made complex by the fact that we do
> not want to have separate partition for snapshot; we just want to
> reuse free space in swap partition.

with the size of drives today is it really that bad to require a seperate
partition for this?

I also don't like the idea of storing this in the swap partition for a couple of
reasons.

1. on many modern linux systems the swap partition is not large enough.

for example, on my boxes with 16G or ram I only allocate 2G of swap space

2. it's too easy for other things to stomp on your swap partition.

for example: booting from a live CD that finds and uses swap partitions

if you are needing space for your freeze, allocate it in an unabigous way, not
by re-useing an existing partition.

David Lang

Linus Torvalds

unread,
Apr 26, 2007, 7:20:06 PM4/26/07
to

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>
> Well, I think that much of what Linus is saying indicates that he hasn't tried
> to write any such thing himself. ;-)

That's definitely true. The only interaction I ever had with "hibernation"
(and yes, we should just call it that) is when I was working on s2ram and
cleaning up the PCI device suspend/resume in particular, and trying
(_mostly_ successfully - I think I broke it once or twice mainly due to
interactions with the console, but on the whole I think it mostly worked)
to not break hibernation in the process without actually running it.

> Now that I know much more than before, I can say I agree with Linus on his
> opinion about the separation of s2ram form the snapshot/restore functionality
> (I'll call it 'hibernation' for simplicity from now on).

So my strong opinion on it literally comes from the other end (ie _not_
knowing about hibernation, but trying to work with s2ram, and cursing the
mixups).

> It should be done, because it would make things simpler and cleaner.
> Still, it will be difficult to do without screwing users en masse and
> that's my main concern here.

I do agree. It will inevitably affect a lot of devices. That's always
painful.

> I don't agree that we don't need the tasks freezer for suspending and
> hibernation. We need it, because we need to be sure that the (other) tasks
> will not get us in the way, and that also applies to kernel threads (and I
> don't think the tasks freezer is 'screwing' them, BTW).

I actually feel much less strongly about that, because just separating out
s2ram and hibernate entirely from each other would already really get the
thing _I_ care about taken care of - being able to work on one of the
other without fear of breaking the other one.

And besides, I actually came into the whole discussion because I'm not a
huge fan of thinking that user-land is "better". If the thing can sanely
be done in kernel, I'm actually all for that. What drives me wild is
having three different things, and nobody driving.

It needs somebody who (a) cares (b) has good taste and (c) has enough time
and personal karma to burn that he can actually take the (obviously)
inevitable heat from just doing things right, and convincing people to
select *one* implementation.

That kind of person is really really hard to find. And if you're it,
you're in for some pain ;)

Linus

Pavel Machek

unread,
Apr 26, 2007, 7:20:09 PM4/26/07
to
Hi!

> >This is basically the loop above, made complex by the fact that we do
> >not want to have separate partition for snapshot; we just want to
> >reuse free space in swap partition.
>
> with the size of drives today is it really that bad to require a seperate
> partition for this?

Yes. You want uswsusp to work in situations where swsusp worked.

> I also don't like the idea of storing this in the swap partition for a
> couple of reasons.
>
> 1. on many modern linux systems the swap partition is not large enough.
>
> for example, on my boxes with 16G or ram I only allocate 2G of swap
> space

WTF? So allocate larger swap partition. You just told me disks are big
enough.

> 2. it's too easy for other things to stomp on your swap partition.
>
> for example: booting from a live CD that finds and uses swap
> partitions

That's a feature. If you are booting from live CD, you _want_ to erase
any hibernation image.

> if you are needing space for your freeze, allocate it in an unabigous way,
> not by re-useing an existing partition.

Of course you have that option. Writing image is done in userspace, so
you are free to write it to raw partition (and first versions indeed
done that).

Pavel Machek

unread,
Apr 26, 2007, 7:30:16 PM4/26/07
to
Hi!

> >That's a feature. If you are booting from live CD, you _want_ to erase
> >any hibernation image.
>

> why?
>
> it's been stated that doing a std and booting another OS (including
> windows) is a valid and common useage. saying that if you boot another OS
> you trash your suspended image doesn't sound reasonable.

If you hibernate your machine, boot from live cd, and change anything
on any filesystem, you are pretty likely to loose that filesystem.

Doing that with Windows is okay as Windows do not usually write to
ext3 partitions.

David Lang

unread,
Apr 26, 2007, 7:30:19 PM4/26/07
to
On Fri, 27 Apr 2007, Pavel Machek wrote:

> Hi!
>
>>> This is basically the loop above, made complex by the fact that we do
>>> not want to have separate partition for snapshot; we just want to
>>> reuse free space in swap partition.
>>
>> with the size of drives today is it really that bad to require a seperate
>> partition for this?
>
> Yes. You want uswsusp to work in situations where swsusp worked.
>
>> I also don't like the idea of storing this in the swap partition for a
>> couple of reasons.
>>
>> 1. on many modern linux systems the swap partition is not large enough.
>>
>> for example, on my boxes with 16G or ram I only allocate 2G of swap
>> space
>
> WTF? So allocate larger swap partition. You just told me disks are big
> enough.

swap partitions are limited to 2G (or at least they were a couple of months ago
when I last checked). I also don't want to run the risk of having a box try to
_use_ 16G worth of swap. I'd rather have the box hit OOM first.

>> 2. it's too easy for other things to stomp on your swap partition.
>>
>> for example: booting from a live CD that finds and uses swap
>> partitions
>
> That's a feature. If you are booting from live CD, you _want_ to erase
> any hibernation image.

why?

it's been stated that doing a std and booting another OS (including windows) is
a valid and common useage. saying that if you boot another OS you trash your
suspended image doesn't sound reasonable.

David Lang

David Lang

unread,
Apr 26, 2007, 7:40:08 PM4/26/07
to
On Fri, 27 Apr 2007, Pavel Machek wrote:

> Hi!
>
>>> That's a feature. If you are booting from live CD, you _want_ to erase
>>> any hibernation image.
>>
>> why?
>>
>> it's been stated that doing a std and booting another OS (including
>> windows) is a valid and common useage. saying that if you boot another OS
>> you trash your suspended image doesn't sound reasonable.
>
> If you hibernate your machine, boot from live cd, and change anything
> on any filesystem, you are pretty likely to loose that filesystem.

booting from a live CD doesn't mean that you are going to mount the filesystem,
let alone change it. but swap is not supposed to be this sensitive.

David Lang

> Doing that with Windows is okay as Windows do not usually write to
> ext3 partitions.
> Pavel
>

Olivier Galibert

unread,
Apr 26, 2007, 8:20:09 PM4/26/07
to
On Fri, Apr 27, 2007 at 06:50:56AM +1000, Nigel Cunningham wrote:
> I'm perfectly willing to think through some alternate approach if you
> suggest something or prod my thinking in a new direction, but I'm afraid
> I just can't see right now how we can achieve what you're after.

Ok, what about this approach I've been mulling about for a while:

Suspend-to-disk is pretty much an exercise in state saving. There are
multiple ways to do state saving, but they tend to end up in two
categories: implicit and explicit.

In implicit state saving, you try to save the state of the
system/application/whatever "under its feet", more or less, and then
fixup what is no saved/saveable correctly. A well-known example is
the undumping process Emacs goes (went?) where it tries to dump the
state of the memory as a new executable, with a lot of pleasure with
various executable formats and subtleties due to side effects in libc
code you don't control.

In explicit state saving each object saves what is needed from its
state to an independently defined format (instead of "whatever the
memory organization happens to be at that point"). When reloading the
state you have to parse it, and it usually requires
rebuilding/relocating all references/pointers/etc. XEmacs currently
has a "portable dumper" that pretty much does just that. We don't
have any redumping problems anymore, they're over.

Which one is the best depends heavily on the application. The amount
of code in the implicit case depends on the amount of fixups to do.
In the kernel case it happens to be a lot, pretty much everything that
touches hardware has to save to memory the device state and reload it
on resume. And bugs on hardware handling can be quite annoying to
debug. And if some driver does not to saving/resume correctly, you
have no way outside of playing with modules to ensure the safety of
the suspend cycle.

The amount of code in the explicit case is an interesting variable in
the case of the kernel. You have to save what is needed, but how do
you define what is needed? It is, pretty much, what running processes
can observe from userspace. Now, what can a process observe:
- its application text and anonymous memory pages
- its file handles
- its mapped files
- its mapped whatever else
- its sys5 IPC stuff
- futex stuff and friends, namespaces, etc
- its intrinsic characteristics it can reach through syscalls
(i.e. the user-visible parts of current, like pid, uid...)
- its currently running system call, if any

So that's what we'd have to explicitely save. Anonymous memory, sys5
IPC, futex and current structures, that's easy stuff in practice. The
fun part are pretty much:
- references to files
- references to active networking links
- references to devices and associated visible state
- currently running system call, aka the kernel stack for the process

The last one is the one I'm the most afraid of. I hope that the
signal stuff and/or the asynchronous syscall stuff that was discussed
recently would allow to "unwind" blocking system calls back to the
syscall level and then store the parameters for resume-time restart.
The non-blocking calls you can just let finish.

The first one is really interesting. If you value your filesystems,
you'd rather have them clean after the suspend. And also you pretty
much know that filesystems can move around when you're not looking, be
it USB hotplug stuff (discovery order is random-ish isn't it?), module
loading order issues or multithreaded device discovery. So you're way
more happy *not* caching anything from the filesystem you can avoid.

But what is a file reference, really? With the dcache handy, it's
pretty much a path, since inodes don't always exist reliably. And if
you have the lists of paths used by the processes on a particular
filesystem, you can easily get an idea of where, if anywhere, the
filesystem is even if you don't have reliable serials. More
interestingly, you cannot, in any case, instantly corrupt your
filesystem by having a mismatch between the in-memory cache and the
reality.

The processes which referenced files you can't find anywhere will
end-up with EBADF or segfault depending on whether it was fd or mmap,
ala revoke(). They'll probably die horribly. I'd rather have
processes die than filesystems die, since in any case if the file
isn't here anymore in practice the process could only destroy things.

An interesting things there, nothing in that touches either the
filesystem or the block devices. Everything is done at the VFS level.
The devices don't need to care. And the "this filesystem goes there"
can be done in userspace in an initramfs if people want to experiment
with kinky strategies. After all, why not allow a sysadmin to regroup
two filesystems into one though a suspend, the processes mostly don't
need to care (well, tar may, but heh). Deleted files would have to be
sillyrenamed or something. Implementation details ;-)

Active networking links, you can consider them dead for a start. The
networking guys can play with keepalives and stuff if they want to in
a second step. Network seldom survives suspend anyway, too many
timeouts involved, especially with dynamic IPs.

That leaves references to devices. null, ptys, random, log are not a
problem, they're virtual constructs. In a first approximation you can
revoke() the rest brutally. On a "standard" system that will kill X
(ouch), GPM and other input-interested devices, and everything with an
opened sound device. Then you can add explicit state saving support
to the devices you want, one by one. It may be possible to handle
sound collectively at the ALSA layer level, I don't really know.
Input shouldn't be too hard, not much state to save, X will be a pain
and will probably need special casing. X is a big special case
anyway, no matter what happens.

For the less directly used devices you can always all explicit support
when you feel like it. The interesting part is that either the device
supports the suspend and says so explicitely, or the process can't
access the device anymore using the previous fds/mmaps after resume.
No weird half-condition. If (very) resilient, the process can even
close, reopen, reconfigure and go on its merry way.

And if you design the saving format correctly (attribute name/value
pairs as text work beautifully for such a case), you can be resilient
to extreme things including kernel version change or rsync-ing / and
the state file and resuming in another box. And if a device gets
something it can't parse as the state to go back to for a given
fd/mmap for a process, it can always revoke() that one and go on.

The main point of that kind of state-saving is to be
trustable-by-design. For each process, either its environment could
be restored correctly or the incorrect parts can not be accessed
anymore. And the stability of the system and its filesystems is
ensured pretty much whatever happens.


There are a billion details to take into account in a real
implementation, but I'm sure you can get the gist of the idea.

OG.

Olivier Galibert

unread,
Apr 26, 2007, 8:30:10 PM4/26/07
to
On Thu, Apr 26, 2007 at 03:49:51PM -0700, David Lang wrote:
> swap partitions are limited to 2G (or at least they were a couple of months
> ago when I last checked). I also don't want to run the risk of having a box
> try to _use_ 16G worth of swap. I'd rather have the box hit OOM first.

They aren't limited anymore, I have a number of machines with 20G swap
for experiments.

OG.

Pekka J Enberg

unread,
Apr 27, 2007, 1:00:13 AM4/27/07
to
On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> > which will map in the snapshot, return the mapped address and the size
> > (and if you want to support snapshots > 4GB, be my guest, but I suspect
> > you're actually *better* off just admitting that if you cannot shrink
> > the snapshot to less than 32 bits, it's not worth doing)

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> That inherently limits the image to half of available ram (you need
> somewhere to store the snapshot), so you won't get the full image you
> express interest in below.

It doesn't. We can make the userspace mapped pages copy-on-write. As long
as the userspace makes sure there's not much activity during
snapshot/shutdown, we will be fine. What we probably do need to copy is
kernel pages.

Pekka

Pekka Enberg

unread,
Apr 27, 2007, 1:50:27 AM4/27/07
to
On 4/27/07, Pavel Machek <pa...@ucw.cz> wrote:
> Now, it would be _very_ nice to be able to snapshot system and
> continue running, but I just don't see how to do it without extensive
> filesystem support.

So what kind of support do we need from the filesystem?

Pekka

Nigel Cunningham

unread,
Apr 27, 2007, 2:10:13 AM4/27/07
to
Hi.

On Fri, 2007-04-27 at 07:52 +0300, Pekka J Enberg wrote:
> On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> > > which will map in the snapshot, return the mapped address and the size
> > > (and if you want to support snapshots > 4GB, be my guest, but I suspect
> > > you're actually *better* off just admitting that if you cannot shrink
> > > the snapshot to less than 32 bits, it's not worth doing)
>
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > That inherently limits the image to half of available ram (you need
> > somewhere to store the snapshot), so you won't get the full image you
> > express interest in below.
>
> It doesn't. We can make the userspace mapped pages copy-on-write. As long
> as the userspace makes sure there's not much activity during
> snapshot/shutdown, we will be fine. What we probably do need to copy is
> kernel pages.

COW is a possibility, but I understood (perhaps wrongly) that Linus was
thinking of a single syscall or such like to prepare the snapshot. If
you're going to start doing things like this, won't that mean you'd then
have to update/redo the snapshot or somehow nullify the effect of
anything the programs does so that doing it again after the snapshot is
restored doesn't cause problems?

I was going to leave it at that and press send, but perhaps that
wouldn't be wise. I feel I should also ask what you're thinking of as a
means of making sure userspace doesn't do much activity.

Thanks for your labours!

Regards,

Nigel

signature.asc

Pekka J Enberg

unread,
Apr 27, 2007, 2:20:07 AM4/27/07
to
On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> COW is a possibility, but I understood (perhaps wrongly) that Linus was
> thinking of a single syscall or such like to prepare the snapshot. If
> you're going to start doing things like this, won't that mean you'd then
> have to update/redo the snapshot or somehow nullify the effect of
> anything the programs does so that doing it again after the snapshot is
> restored doesn't cause problems?

No. The snapshot is just that. A snapshot in time. From kernel point of
view, it doesn't matter one bit what when you did it or if the state has
changed before you resume. It's up to userspace to make sure the user
doesn't do real work while the snapshot is being written to disk and
machine is shut down.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> I was going to leave it at that and press send, but perhaps that
> wouldn't be wise. I feel I should also ask what you're thinking of as a
> means of making sure userspace doesn't do much activity.

When the snapshot pages are COW, we will run out of memory if userspace
writes to those pages too much. If userspace is blocked, say like
displaying a "we are suspending" in X which blocks the user from using
other programs that could generate new writes and mounting filesystems
read-only, we don't need to worry about running out of memory.

Pekka J Enberg

unread,
Apr 27, 2007, 2:30:12 AM4/27/07
to
On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?

On Fri, 27 Apr 2007, Pekka J Enberg wrote:
> No. The snapshot is just that. A snapshot in time. From kernel point of
> view, it doesn't matter one bit what when you did it or if the state has
> changed before you resume. It's up to userspace to make sure the user
> doesn't do real work while the snapshot is being written to disk and
> machine is shut down.

Btw, obviously we need to break the COW when resuming and not include the
snapshot mapping. However, that should be trivially doable by snapshotting
the page mappings before remapping them as COW.

Nigel Cunningham

unread,
Apr 27, 2007, 2:40:07 AM4/27/07
to
Hi.

On Fri, 2007-04-27 at 09:18 +0300, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
>
> No. The snapshot is just that. A snapshot in time. From kernel point of
> view, it doesn't matter one bit what when you did it or if the state has
> changed before you resume. It's up to userspace to make sure the user
> doesn't do real work while the snapshot is being written to disk and
> machine is shut down.

Sorry Pekka, but that's just broken.

It implies firstly that we tell all userspace programs "I'm sorry, but
I'm suspending at the moment. Can you tip toe quietly around while I do
it?" You can't seriously expect every userspace program to be modified
to adjust it's behaviour according to whether we're writing a snapshot
to disk at the moment or not.

It also implies that we can prepare a snapshot and then happily have the
contents of the disk change so that they don't match the superblock and
other filesystem details we just saved in the snapshot. We can't. At
least not without modifying all the filesystems so that (at a minimum)
they know how to throw away all the metadata they have at resume time
and reread it from disk.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > I was going to leave it at that and press send, but perhaps that
> > wouldn't be wise. I feel I should also ask what you're thinking of as a
> > means of making sure userspace doesn't do much activity.
>
> When the snapshot pages are COW, we will run out of memory if userspace
> writes to those pages too much. If userspace is blocked, say like
> displaying a "we are suspending" in X which blocks the user from using
> other programs that could generate new writes and mounting filesystems
> read-only, we don't need to worry about running out of memory.

This sounds feasible, but it's only really acceptable if your willing to
have hibernation fail or restart multiple times. If your battery is
running out or you need to rush to put a lappy in your bag because they
train just came early, that's not an option. It's for that very reason
that I've put a lot of effort into trying to make it work first time,
every time. Not there yet, but it's a priority.

By the way, sorry. This email feels like it is pouring a lot of cold
water on your ideas. I don't want to be negative!

Regards,

Nigel

signature.asc

Pekka J Enberg

unread,
Apr 27, 2007, 3:00:12 AM4/27/07
to
On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Sorry Pekka, but that's just broken.

It certainly isn't.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> It implies firstly that we tell all userspace programs "I'm sorry, but
> I'm suspending at the moment. Can you tip toe quietly around while I do
> it?" You can't seriously expect every userspace program to be modified
> to adjust it's behaviour according to whether we're writing a snapshot
> to disk at the moment or not.

You don't need to modify other programs. You just need to display the
progress bar and block _user input_. I don't even claim to know X, but I
would be extremely surprised if you technically can't say "don't let
the user touch any other windows except this one." The user couldn't care
less whether tasks are frozen or not by the kernel. What matters is that
the user can't shoot himself in the foot while snapshotting.

Furthermore, we probably do need to do other things to ensure safety, like
remounting filesystems read-only but again, this has nothing to do with
snapshotting per se. What the kernel needs to worry about is (1) providing
an atomic snapshot that is consistent and (2) resuming to that snapshot
safely. If the _user_ loses data that was generated between snapshot +
shutdown, it's absolutely no concern for the snapshot operation!

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> It also implies that we can prepare a snapshot and then happily have the
> contents of the disk change so that they don't match the superblock and
> other filesystem details we just saved in the snapshot. We can't. At
> least not without modifying all the filesystems so that (at a minimum)
> they know how to throw away all the metadata they have at resume time
> and reread it from disk.

But you just explained how we can! We shouldn't bend over backwards for
snapshotting just because the filesystems don't currently support
something we need.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> By the way, sorry. This email feels like it is pouring a lot of cold
> water on your ideas. I don't want to be negative!

Don't worry, I am used to cold water :-).

Nigel Cunningham

unread,
Apr 27, 2007, 3:10:10 AM4/27/07
to
Hi.

On Fri, 2007-04-27 at 09:50 +0300, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > Sorry Pekka, but that's just broken.
>
> It certainly isn't.
>
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > It implies firstly that we tell all userspace programs "I'm sorry, but
> > I'm suspending at the moment. Can you tip toe quietly around while I do
> > it?" You can't seriously expect every userspace program to be modified
> > to adjust it's behaviour according to whether we're writing a snapshot
> > to disk at the moment or not.
>
> You don't need to modify other programs. You just need to display the
> progress bar and block _user input_. I don't even claim to know X, but I
> would be extremely surprised if you technically can't say "don't let
> the user touch any other windows except this one." The user couldn't care
> less whether tasks are frozen or not by the kernel. What matters is that
> the user can't shoot himself in the foot while snapshotting.

User input doesn't account for all system activity. Think of cron jobs
or user initiated jobs that may have started before the cycle began.

> Furthermore, we probably do need to do other things to ensure safety, like
> remounting filesystems read-only but again, this has nothing to do with
> snapshotting per se. What the kernel needs to worry about is (1) providing
> an atomic snapshot that is consistent and (2) resuming to that snapshot
> safely. If the _user_ loses data that was generated between snapshot +
> shutdown, it's absolutely no concern for the snapshot operation!

Noooo! If the user looses data, the user will be concerned and we should
be. I for one would do my best to avoid using software that loses my
data for me. I wouldn't care if you said "Well, it's your fault. You
lost the data." From my perspective as a user, I didn't lose the data,
some part of the computer's OS did.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > It also implies that we can prepare a snapshot and then happily have the
> > contents of the disk change so that they don't match the superblock and
> > other filesystem details we just saved in the snapshot. We can't. At
> > least not without modifying all the filesystems so that (at a minimum)
> > they know how to throw away all the metadata they have at resume time
> > and reread it from disk.
>
> But you just explained how we can! We shouldn't bend over backwards for
> snapshotting just because the filesystems don't currently support
> something we need.

Sorry, but I just don't believe filesystems should need to throw away
metadata post resume. If we let data be changed after snapshotting (or
ourselves cause it to be changed), we're the ones that are broken. Our
snapshot is out of date and the expectations of userspace programs that
were snapshotted will be out of date. Just imagine, for example, a
userspace program that is snapshotted, then reads and deletes a
temporary file. After the snapshot restore, it's running again. But
wait, we can't read or delete the file again because it's already gone.
Life just gets more complicated and confusing this way.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > By the way, sorry. This email feels like it is pouring a lot of cold
> > water on your ideas. I don't want to be negative!
>
> Don't worry, I am used to cold water :-).

Maybe, but I'd still rather be encouraging!

Nigel

signature.asc

Pekka J Enberg

unread,
Apr 27, 2007, 3:30:09 AM4/27/07
to
On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> User input doesn't account for all system activity. Think of cron jobs
> or user initiated jobs that may have started before the cycle began.

Yes, but the _user_ did not start them so they didn't lose any work. See,
it might or might not be important but that's something the _userspace_
has much more knowledge than the kernel ever will.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Noooo! If the user looses data, the user will be concerned and we should
> be. I for one would do my best to avoid using software that loses my
> data for me. I wouldn't care if you said "Well, it's your fault. You
> lost the data." From my perspective as a user, I didn't lose the data,
> some part of the computer's OS did.

You are looking at snapshot/shutdown from kernel and user experience point
of view at the same time which causes confusion here.

Let me repeat: it is _absolutely no concern_ of the _kernel_ whether you
resume to a snapshot that does not contain all your precious data. The
kernel doesn't care one bit!

That being said, the _userspace solution_ obviously needs to take this
into account by blocking user input, making filesystems read-only, and
maybe even blocking certain background processes (cron and beagle indexing
come into mind).

It doesn't. We can either make the filesystem read-only or, surprise,
surprise, make a _snapshot_ of the filesystem!

And while the points you raised are important for the full
end-user solution, it is absolutely not interesting to snapshot_system().
The only thing it needs to guarantee is a consistent snapshot that we can
resume later.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Maybe, but I'd still rather be encouraging!

You are. Perhaps you just don't know it yet. ;-)

Pekka Enberg

unread,
Apr 27, 2007, 4:00:22 AM4/27/07
to
On 4/26/07, Linus Torvalds <torv...@linux-foundation.org> wrote:
> In fact, I personally feel that I shouldn't even have merged
> userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
> feelings simply don't matter that much. I have to trust people. But yes,
> as far as *I* am personally concerned, I think it was a mistake to merge
> it.

While the ioctl() interface is horrid, I think it's actually in
principle pretty close to your snapshot_system()/resume_snapshot().
The ugliness probably comes from the fact that suspend to RAM and
snapshot/shutdown are interleaved there too.

Oliver Neukum

unread,
Apr 27, 2007, 6:10:16 AM4/27/07
to
Am Freitag, 27. April 2007 08:18 schrieb Pekka J Enberg:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
>
> No. The snapshot is just that. A snapshot in time. From kernel point of
> view, it doesn't matter one bit what when you did it or if the state has
> changed before you resume. It's up to userspace to make sure the user
> doesn't do real work while the snapshot is being written to disk and
> machine is shut down.

And where is the benefit in that? How is such user space freezing logic
simpler than having the kernel do the write?
What can you do in user space if all filesystems are r/o that is worth the
hassle?

Regards
Oliver

Christoph Hellwig

unread,
Apr 27, 2007, 7:33:00 AM4/27/07
to
On Thu, Apr 26, 2007 at 05:38:07PM -0400, Theodore Tso wrote:
> On Fri, Apr 27, 2007 at 06:08:01AM +1000, Nigel Cunningham wrote:
> > We tried that. It would need some work. IIRC remounting filesystems
> > read-only makes files become marked read-only. Perfectly sensible,
> > except that if you then remount the filesystem rw at resume time, all
> > those files are still marked ro and userspace crashes and burns. Not
> > unfixable, I'll agree, but there is more work to do there.
>
> There are other solutions, though. One is that we could export a
> system call interface which freezes a filesystem and prevents any
> further I/O. We mostly have something like that right now (via the
> the write_super_lockfs function in the superblock operations
> structure), but we haven't exported it to userspace.

It is exported on XFS ;-)

> We would also need a similar interface to freeze any block device I/O,
> in case you have a database running and doing direct I/O to a block
> device. (Or again, we could simply not support that case; how many
> people will be running running a database accessing a block deivce on
> their laptop?)

block device I/O uses generic_file*whateveriscurrenthere*_write, which
checks for the freeze flag, so the infrastructure for that is there
aswell.

Daniel Pittman

unread,
Apr 27, 2007, 7:33:16 AM4/27/07
to
Olivier Galibert <gali...@pobox.com> writes:
> On Fri, Apr 27, 2007 at 06:50:56AM +1000, Nigel Cunningham wrote:
>
>> I'm perfectly willing to think through some alternate approach if you
>> suggest something or prod my thinking in a new direction, but I'm
>> afraid I just can't see right now how we can achieve what you're
>> after.
>
> Ok, what about this approach I've been mulling about for a while:
>
> Suspend-to-disk is pretty much an exercise in state saving. There are
> multiple ways to do state saving, but they tend to end up in two
> categories: implicit and explicit.

[...]

> In explicit state saving each object saves what is needed from its
> state to an independently defined format (instead of "whatever the
> memory organization happens to be at that point"). When reloading the
> state you have to parse it, and it usually requires
> rebuilding/relocating all references/pointers/etc.

If you are looking seriously at this you might want to start with the
code in the OpenVZ kernel (http://openvz.org) that allows a VE to
"checkpoint" to disk and "restore" on the same or a different machine.

This is, as far as I can tell, a portable implementation of this that
already handles real live userspace applications moving transparently
between two machines.

It has the advantage that it lives in an orderly world where most
devices and the file system are virtual but, hey, it works right now.

Regards,
Daniel
--
Digital Infrastructure Solutions -- making IT simple, stable and secure
Phone: 0401 155 707 email: con...@digital-infrastructure.com.au
http://digital-infrastructure.com.au/

Pekka J Enberg

unread,
Apr 27, 2007, 7:34:36 AM4/27/07
to
Am Freitag, 27. April 2007 08:18 schrieb Pekka J Enberg:
> > No. The snapshot is just that. A snapshot in time. From kernel point of
> > view, it doesn't matter one bit what when you did it or if the state has
> > changed before you resume. It's up to userspace to make sure the user
> > doesn't do real work while the snapshot is being written to disk and
> > machine is shut down.

On Fri, 27 Apr 2007, Oliver Neukum wrote:
> And where is the benefit in that? How is such user space freezing logic
> simpler than having the kernel do the write?
>
> What can you do in user space if all filesystems are r/o that is worth the
> hassle?

I am talking about snapshot_system() here. It's not given that the
filesystems need to be read-only (you can snapshot them too). The benefit
here is that you can do whatever you want with the snapshot (encrypt,
compress, send over the network) and have a clean well-defined interface
in the kernel. In addition, aborting the snapshot is simpler, simply
munmap() the snapshot.

The problem with writing in the kernel is obvious: we need to add new code
to the kernel for compression, encryption, and userspace interaction
(graphical progress bar) that are important for user experience.

Pekka

Pavel Machek

unread,
Apr 27, 2007, 9:00:14 AM4/27/07
to
Hi!

> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
>
> I'd actually like to discuss this a bit..
>
> I'm obviously not a huge fan of the whole user/kernel level split and
> interfaces, but I actually do think that there is *one* split that makes
> sense:
>
> - generate the (whole) snapshot image entirely inside the kernel
>
> - do nothing else (ie no IO at all), and just export it as a single image
> to user space (literally just mapping the pages into user space).
> *one* interface. None of the "pretty UI update" crap. Just a single
> system call:
>
> void *snapshot_system(u32 *size);
>

> which will map in the snapshot, return the mapped address and the size
> (and if you want to support snapshots > 4GB, be my guest, but I suspect
> you're actually *better* off just admitting that if you cannot shrink
> the snapshot to less than 32 bits, it's not worth doing)

I think this is very similar to current uswsusp design; except that we
are using read on /dev/snapshot to read the snapshot (not memory
mapping) and that we freeze the system (because I do not think killall
_SIGSTOP is enough).

Can you confirm that it is indeed similar design, or tell me why I'm
wrong? You had some pretty strong words for uswsusp before, so I'd
like to understand your position here. ("Ouch, I do not know, I am out
of time" is still better reply than silence.)

Pavel Machek

unread,
Apr 27, 2007, 11:00:18 AM4/27/07
to
On Fri 2007-04-27 08:41:56, Pekka Enberg wrote:
> On 4/27/07, Pavel Machek <pa...@ucw.cz> wrote:
> >Now, it would be _very_ nice to be able to snapshot system and
> >continue running, but I just don't see how to do it without extensive
> >filesystem support.
>
> So what kind of support do we need from the filesystem?

"forcedremount ro, not telling anyone, not killing processes" would do
the trick. FS snapshots might do.

Oliver Neukum

unread,
Apr 27, 2007, 3:20:06 PM4/27/07
to
Am Freitag, 27. April 2007 12:12 schrieb Pekka J Enberg:
> I am talking about snapshot_system() here. It's not given that the
> filesystems need to be read-only (you can snapshot them too). The benefit
> here is that you can do whatever you want with the snapshot (encrypt,
> compress, send over the network)  and have a clean well-defined interface
> in the kernel. In addition, aborting the snapshot is simpler, simply
> munmap() the snapshot.

But is that worth the trade off?

> The problem with writing in the kernel is obvious: we need to add new code
> to the kernel for compression, encryption, and userspace interaction
> (graphical progress bar) that are important for user experience.

The kernel can already do compression and encryption.

Regards
Oliver

Rafael J. Wysocki

unread,
Apr 27, 2007, 5:30:15 PM4/27/07
to
On Friday, 27 April 2007 06:52, Pekka J Enberg wrote:
> On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> > > which will map in the snapshot, return the mapped address and the size
> > > (and if you want to support snapshots > 4GB, be my guest, but I suspect
> > > you're actually *better* off just admitting that if you cannot shrink
> > > the snapshot to less than 32 bits, it's not worth doing)
>
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > That inherently limits the image to half of available ram (you need
> > somewhere to store the snapshot), so you won't get the full image you
> > express interest in below.
>
> It doesn't. We can make the userspace mapped pages copy-on-write. As long
> as the userspace makes sure there's not much activity during
> snapshot/shutdown, we will be fine. What we probably do need to copy is
> kernel pages.

The user space is (and IMHO should be) frozen way before that and what you're
suggesting here is what I wanted to implement some time ago. The problem with
this was that the user space pages may be updated, for example, by device
drivers as a result of some deferred I/O after we've snapshotted the system.

I didn't know how to find out which pages owned by the user space could be
updated this way, so I gave up at that time.

Greetings,
Rafael

Rafael J. Wysocki

unread,
Apr 27, 2007, 5:30:18 PM4/27/07
to
On Friday, 27 April 2007 08:18, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
>
> No. The snapshot is just that. A snapshot in time. From kernel point of
> view, it doesn't matter one bit what when you did it or if the state has
> changed before you resume. It's up to userspace to make sure the user
> doesn't do real work while the snapshot is being written to disk and
> machine is shut down.

Why do you think that keeping the user space frozen after 'snapshot' is a bad
idea? I think that solves many of the problems you're discussing.

Greetings,
Rafael

Rafael J. Wysocki

unread,
Apr 27, 2007, 5:30:22 PM4/27/07
to
On Friday, 27 April 2007 14:49, Pavel Machek wrote:
> Hi!
>
> > > * Doing things in the right order? (Prepare the image, then do the
> > > atomic copy, then save).
> >
> > I'd actually like to discuss this a bit..
> >
> > I'm obviously not a huge fan of the whole user/kernel level split and
> > interfaces, but I actually do think that there is *one* split that makes
> > sense:
> >
> > - generate the (whole) snapshot image entirely inside the kernel
> >
> > - do nothing else (ie no IO at all), and just export it as a single image
> > to user space (literally just mapping the pages into user space).
> > *one* interface. None of the "pretty UI update" crap. Just a single
> > system call:
> >
> > void *snapshot_system(u32 *size);
> >
> > which will map in the snapshot, return the mapped address and the size
> > (and if you want to support snapshots > 4GB, be my guest, but I suspect
> > you're actually *better* off just admitting that if you cannot shrink
> > the snapshot to less than 32 bits, it's not worth doing)
>
> I think this is very similar to current uswsusp design; except that we
> are using read on /dev/snapshot to read the snapshot (not memory
> mapping) and that we freeze the system

Yes, it seems so.

> (because I do not think killall _SIGSTOP is enough).

Agreed.

Greetings,
Rafael

Nigel Cunningham

unread,
Apr 27, 2007, 5:50:10 PM4/27/07
to
Hi.

On Fri, 2007-04-27 at 16:55 +0200, Pavel Machek wrote:
> On Fri 2007-04-27 08:41:56, Pekka Enberg wrote:
> > On 4/27/07, Pavel Machek <pa...@ucw.cz> wrote:
> > >Now, it would be _very_ nice to be able to snapshot system and
> > >continue running, but I just don't see how to do it without extensive
> > >filesystem support.
> >
> > So what kind of support do we need from the filesystem?
>
> "forcedremount ro, not telling anyone, not killing processes" would do
> the trick. FS snapshots might do.

It sounds to me more like Pekka is thinking of checkpointing support. If
that's the case, then remounting filesystems isn't going to be an
option. You want to freeze them for just long enough so that you can
determine what needs saving in the checkpoint. You certainly don't want
to make rw file handles ro and so on.

Nigel

signature.asc

Linus Torvalds

unread,
Apr 27, 2007, 5:50:14 PM4/27/07
to

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>
> Why do you think that keeping the user space frozen after 'snapshot' is a bad
> idea? I think that solves many of the problems you're discussing.

It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do

gdb -p <snapshotter>

when something goes wrong?) but we also *depend* on user space for various
things (the same way we depend on kernel threads, and why it has been such
a total disaster to try to freeze the kernel threads too!). For example,
if you want to do graphical stuff, just using X would be quite nice,
wouldn't it?

But I do agree that doing everythign in the kernel is likely to just be a
hell of a lot simpler for everybody.

Linus

Nigel Cunningham

unread,
Apr 27, 2007, 6:10:08 PM4/27/07
to
Hi.

On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> >
> > Why do you think that keeping the user space frozen after 'snapshot' is a bad
> > idea? I think that solves many of the problems you're discussing.
>
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
>
> gdb -p <snapshotter>

Make the machine being suspended a VM and you can already do that.

> when something goes wrong?) but we also *depend* on user space for various
> things (the same way we depend on kernel threads, and why it has been such
> a total disaster to try to freeze the kernel threads too!). For example,
> if you want to do graphical stuff, just using X would be quite nice,
> wouldn't it?

It would be nice, yes.

But in doing so you make the contents of the disk inconsistent with the
state you've just snapshotted, leading to filesystem corruption. Even if
you modify filesystems to do checkpointing (which is what we're really
talking about), you still also have the problem that your snapshot has
to be stored somewhere before you write it to disk, so you also have to
either

1) write some known static memory to disk before the snapshot and reuse
it for the snapshot,
2) ensure up to half the RAM is free for your snapshot or
3) compress the snapshot as you take it, guessing beforehand how much
memory the compressed snapshot might take and freeing that might
4) reserve memory at boot time for the atomic copy so that 2) or 3) is
still done, but without having to free the memory. (Yuk!).

> But I do agree that doing everythign in the kernel is likely to just be a
> hell of a lot simpler for everybody.

Indeed.

Nigel

signature.asc

Rafael J. Wysocki

unread,
Apr 27, 2007, 6:10:12 PM4/27/07
to
On Friday, 27 April 2007 23:44, Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> >
> > Why do you think that keeping the user space frozen after 'snapshot' is a bad
> > idea? I think that solves many of the problems you're discussing.
>
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
>
> gdb -p <snapshotter>
>
> when something goes wrong?) but we also *depend* on user space for various
> things (the same way we depend on kernel threads, and why it has been such
> a total disaster to try to freeze the kernel threads too!).

We're freezing many of them just fine. ;-)

> For example, if you want to do graphical stuff, just using X would be quite
> nice, wouldn't it?

Yes, it would, but as long as we can't protect mounted filesystems from being
touched, it's just dangerous to let the user space run at that point.

> But I do agree that doing everythign in the kernel is likely to just be a
> hell of a lot simpler for everybody.

:-)

Greetings,
Rafael

Linus Torvalds

unread,
Apr 27, 2007, 6:20:07 PM4/27/07
to

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> We're freezing many of them just fine. ;-)

And can you name a _single_ advantage of doing so?

It so happens, that most people wouldn't notice or care that kmirrord got
frozen (kernel thread picked at random - it might be one of the threads
that has gotten special-cased to not do that), but I have yet to hear a
single coherent explanation for why it's actually a good idea in the first
place.

And it has added totally idiotic code to every single kernel thread main
loop. For _no_ reason, except that the concept was broken, and needed more
breakage to just make it work.

Linus

Rafael J. Wysocki

unread,
Apr 27, 2007, 6:40:15 PM4/27/07
to
On Saturday, 28 April 2007 00:08, Linus Torvalds wrote:
>
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > We're freezing many of them just fine. ;-)
>
> And can you name a _single_ advantage of doing so?

Yes. We have a lot less interdependencies to worry about during the whole
operation.

> It so happens, that most people wouldn't notice or care that kmirrord got
> frozen (kernel thread picked at random - it might be one of the threads
> that has gotten special-cased to not do that), but I have yet to hear a
> single coherent explanation for why it's actually a good idea in the first
> place.

Well, I don't know if that's a 'coherent' explanation from your point of view
(probably not), but I'll try nevertheless:
1) if the kernel threads are frozen, we know that they don't hold any locks
that could interfere with the freezing of device drivers,
2) if they are frozen, we know, for example, that they won't call user mode
helpers or do similar things,
3) if they are frozen, we know that they won't submit I/O to disks and
potentially damage filesystems (suspend2 has much more problems with that
than swsusp, but still. And yes, there have been bug reports related to it,
so it's not just my fantasy).

Probably some other people can say more about it.

> And it has added totally idiotic code to every single kernel thread main
> loop. For _no_ reason, except that the concept was broken, and needed more
> breakage to just make it work.

It is actually useful for some things other than the hibernation/suspend, the
code is not idiotic (it's one line of code in the majority of cases) and you
should take that "I hate everything even remotely related to hibernation" hat
off, really.

Greetings,
Rafael