swsusp performance problems in 2.6.15-rc3-mm1

Andy Isaacson

unread,

Dec 5, 2005, 3:30:13 AM12/5/05

to

On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my laptop (1.25
GB, P4M 1.4 GHz) was a pretty fast process; freeing memory took about 3
seconds or less, and writing out the swap image took less than 5
seconds, so within 15 seconds of running my suspend script power was
off.

The downside was that after suspend, *everything* needed to be paged
back in, so all my apps were *very* slow for the first few interactions.
It would take about 15 or 20 seconds for Firefox to repaint the first
time I switched to its virtual desktop, and it was perceptibly slower
than normal for the next 5 or 10 minutes of use.

Now that I'm running 2.6.15-rc3-mm1, the page-in problem seems to be
largely gone; I don't notice a significant lagginess after resuming from
swsusp.

But the suspend process is *slow*. It takes a good 20 or 30 seconds to
write out the image, which makes the overall suspend process take close
to a minute; it's writing about 400 MB, and my disk seems to only be
good for about 18 MB/sec according to hdparm -t.

And, the resume is about the same amount slower, too.

Certainly there's a tradeoff to be made, and I'm glad to lose the slow
re-paging after resume, but I'm hoping that some kind of improvement can
be made in the suspend/resume time.

Could we perhaps throw away *half* the cached memory rather than all of
it? Or keep a lazy list of pages that need re-reading and page them in
asynchronously after restarting userland?

-andy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Pavel Machek

unread,

Dec 5, 2005, 7:30:13 AM12/5/05

to

Hi!

> On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my laptop (1.25
> GB, P4M 1.4 GHz) was a pretty fast process; freeing memory took about 3
> seconds or less, and writing out the swap image took less than 5
> seconds, so within 15 seconds of running my suspend script power was
> off.

So suspend took 15 second, and boot another 5 to read the image + 20
first time desktops are switched. ... ~40 second total.

> The downside was that after suspend, *everything* needed to be paged
> back in, so all my apps were *very* slow for the first few interactions.
> It would take about 15 or 20 seconds for Firefox to repaint the first
> time I switched to its virtual desktop, and it was perceptibly slower
> than normal for the next 5 or 10 minutes of use.
>
> Now that I'm running 2.6.15-rc3-mm1, the page-in problem seems to be
> largely gone; I don't notice a significant lagginess after resuming from
> swsusp.
>
> But the suspend process is *slow*. It takes a good 20 or 30 seconds to
> write out the image, which makes the overall suspend process take close
> to a minute; it's writing about 400 MB, and my disk seems to only be
> good for about 18 MB/sec according to hdparm -t.

Lets say 20 seconds suspend, plus 20 seconds resume, and no time
needed to switch the desktops. So it is ~40 seconds total, again ;-).

> And, the resume is about the same amount slower, too.
>
> Certainly there's a tradeoff to be made, and I'm glad to lose the slow
> re-paging after resume, but I'm hoping that some kind of improvement can
> be made in the suspend/resume time.

Of course, there are many ways to improve suspend. Some are easy, some
are hard, some can be merged, and some can not.

> Could we perhaps throw away *half* the cached memory rather than all of
> it?

Should be easy, mergeable and possibly very effective. Relevant code
is in kernel/power/disk.c.

> Or keep a lazy list of pages that need re-reading and page them in
> asynchronously after restarting userland?

This would be fine if you can do it in userspace, but it is not going
to be so easy... ... actually, there's one entry in FAQ:

Q: After resuming, system is paging heavilly, leading to very bad
interactivity.

A: Try running

cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` >
/dev/null

after resume. swapoff -a; swapon -a may also be usefull.

...does that help for you?

Other possible ideas are:

* get suspend to RAM working if you want it *really* fast :-)

* compress the image. Needs to be done in userspace, so it needs
uswsusp to be merged, first. Patches for that are available. Should
speed it up about twice.

* and of course you can apply one very big patch and do all of the
above :-).

Pavel
--
Thanks, Sharp!

Nigel Cunningham

unread,

Dec 5, 2005, 9:10:25 AM12/5/05

to

Hi.

On Mon, 2005-12-05 at 22:17, Pavel Machek wrote:
> Hi!
>
> > On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my laptop (1.25
> > GB, P4M 1.4 GHz) was a pretty fast process; freeing memory took about 3
> > seconds or less, and writing out the swap image took less than 5
> > seconds, so within 15 seconds of running my suspend script power was
> > off.
>
> So suspend took 15 second, and boot another 5 to read the image + 20
> first time desktops are switched. ... ~40 second total.

Plus what is mentioned in the next paragraph.

That's not apples with apples though. If you have a hopeless battery, as
many do, suspend to ram is only good if you're moving from one power
point to another.

> * compress the image. Needs to be done in userspace, so it needs
> uswsusp to be merged, first. Patches for that are available. Should
> speed it up about twice.

That's not true at all. You have cryptoapi in kernel space and can
easily use it - it's very similar code to what you already have for
encryption. You won't get double the speed with with the deflate
compressor - more like 2 or 3MB/s :(. Suspend2 gets double the speed
because we use lzf, which is a logically distinction addition
(implemented now as another cryptoapi plugin).

> * and of course you can apply one very big patch and do all of the
> above :-).

Could you stop being nasty, please?

Yes, suspend2 is bigger, but let's keep things in perspective. Including
comments and so on, it's about 12000 lines. fs/ext3 contains 15000 lines
and fs/xfs is just below 115000 lines. For those 12000 lines you get a
clean internal api, support for compression, encryption, swap
partitions, swap files and ordinary files. You get asynchronous I/O and
read ahead where I/O needs to be synchronous. You get saving a full
image of memory and support for a nice user interface (mostly in
userspace). It's not 12000 lines of bloat, but real functionality that
people are using right now.

Talking about suspend2 like it's just bloatware is unfair and does
nothing to help provide users of Linux with a suspend to disk that's as
good as it can be.

Nigel

Pavel Machek

unread,

Dec 5, 2005, 12:40:23 PM12/5/05

to

Hi!

[BTW right function to modify is swsusp_shrink_memory, it is quite
clear what it is doing, so finding formula that frees enough to make
it fast but not so much to make it unresponsive should be easy.]

> > Other possible ideas are:
> >
> > * get suspend to RAM working if you want it *really* fast :-)
>
> That's not apples with apples though. If you have a hopeless battery, as
> many do, suspend to ram is only good if you're moving from one power
> point to another.

Yes, it is not completely fair. But as I started to use X32 with good
battery... well I'm not really using swsusp any more.

> > * compress the image. Needs to be done in userspace, so it needs
> > uswsusp to be merged, first. Patches for that are available. Should
> > speed it up about twice.
>
> That's not true at all. You have cryptoapi in kernel space and can
> easily use it - it's very similar code to what you already have for
> encryption. You won't get double the speed with with the deflate
> compressor - more like 2 or 3MB/s :(. Suspend2 gets double the speed
> because we use lzf, which is a logically distinction addition
> (implemented now as another cryptoapi plugin).

Well, 3MB/sec improvement will save him seconds on 20, or something
like that, so I guess LZF *is* a way to go, and I'd like to keep that
out of kernel. And I will not accept compression into mainline swsusp;
did that experiment with encryption once already, and I did not like
the result much.

If goal is "make it work with least effort", answer is of course
suspend2; but I need someone to help me doing it right.

> > * and of course you can apply one very big patch and do all of the
> > above :-).
>
> Could you stop being nasty, please?

Sorry, I was trying to be funny.
Pavel
--
Boycott Kodak -- for their patent abuse against Java.

Nigel Cunningham

unread,

Dec 5, 2005, 4:20:12 PM12/5/05

to

Hi.

On Tue, 2005-12-06 at 03:29, Pavel Machek wrote:
> > That's not apples with apples though. If you have a hopeless battery, as
> > many do, suspend to ram is only good if you're moving from one power
> > point to another.
>
> Yes, it is not completely fair. But as I started to use X32 with good
> battery... well I'm not really using swsusp any more.

Ah. Personally, I quite like standby mode too - no need to vga post at
all. But my lappy's battery is one of those hopeless ones, so it won't
do for long. :(

> > > * compress the image. Needs to be done in userspace, so it needs
> > > uswsusp to be merged, first. Patches for that are available. Should
> > > speed it up about twice.
> >
> > That's not true at all. You have cryptoapi in kernel space and can
> > easily use it - it's very similar code to what you already have for
> > encryption. You won't get double the speed with with the deflate
> > compressor - more like 2 or 3MB/s :(. Suspend2 gets double the speed
> > because we use lzf, which is a logically distinction addition
> > (implemented now as another cryptoapi plugin).
>
> Well, 3MB/sec improvement will save him seconds on 20, or something
> like that, so I guess LZF *is* a way to go, and I'd like to keep that
> out of kernel. And I will not accept compression into mainline swsusp;
> did that experiment with encryption once already, and I did not like
> the result much.

No - I didn't mean a 3MB/s improvement. I meant that you'll get about
3MB/s throughput. It's _very_ slow. Of course having said that, I don't
recall now what machine I saw that on. It may well be my 933 Celeron
(Omnibook XE3).

> If goal is "make it work with least effort", answer is of course
> suspend2; but I need someone to help me doing it right.

How do you think suspend2 does it wrong? Is it just that you think that
everything belongs in userspace, or is there more to it?

> > > * and of course you can apply one very big patch and do all of the
> > > above :-).
> >
> > Could you stop being nasty, please?
>
> Sorry, I was trying to be funny.

Ok. Sorry if I over-reacted.

Regards,

Nigel

Rafael J. Wysocki

unread,

Dec 5, 2005, 6:30:06 PM12/5/05

to

Hi,

On Monday, 5 December 2005 14:58, Nigel Cunningham wrote:
> On Mon, 2005-12-05 at 22:17, Pavel Machek wrote:
> > > On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my laptop (1.25
> > > GB, P4M 1.4 GHz) was a pretty fast process; freeing memory took about 3
> > > seconds or less, and writing out the swap image took less than 5
> > > seconds, so within 15 seconds of running my suspend script power was
> > > off.
> >
> > So suspend took 15 second, and boot another 5 to read the image + 20
> > first time desktops are switched. ... ~40 second total.
>
> Plus what is mentioned in the next paragraph.

Indeed. Yet, the point has been made and backed up with some numbers:
There's at least one swsusp user (Andy) who would apparently _prefer_ if more
memory were freed during suspend. The reason is the amount of
RAM in the Andy's box.

}-- snip --{

> > * and of course you can apply one very big patch and do all of the
> > above :-).
>
> Could you stop being nasty, please?
>
> Yes, suspend2 is bigger, but let's keep things in perspective. Including
> comments and so on, it's about 12000 lines. fs/ext3 contains 15000 lines
> and fs/xfs is just below 115000 lines. For those 12000 lines you get a
> clean internal api, support for compression, encryption, swap
> partitions, swap files and ordinary files. You get asynchronous I/O and
> read ahead where I/O needs to be synchronous. You get saving a full
> image of memory and support for a nice user interface (mostly in
> userspace). It's not 12000 lines of bloat, but real functionality that
> people are using right now.

Let me say I think you're doing a great job with maintaining suspend2.
It looks like a really difficult thing to do, particularly in recent times.
Moreover, you have solved many very difficult problems and I respect
that very much. Still, I don't agree with some points you are making.

First, I don't think that saving a full image of memory is generally a good
idea. It is - for systems with relatively small RAM, but for systems with
more than, say, 512 MB that's questionable. Of course that depends a lot
on the usage patterns of particular system, but having 768 MB of RAM
in my box I wouldn't like it to save more than a half of it during suspend,
for performance reasons.

Second, IMHO, some things you are doing in suspend2, like image encryption,
or accessing ordinary files, should not be implemented in the kernel.

That said, I think at least some of the functionalities you have already
implemented in suspend2 are needed and generally can be shared between
your code and swsusp. I've been going to look for such possibilities for some
time, but unfortunately, in its downloadable form, your patch is
quite difficult to follow, so if you have a version that is organized in a more
functionality-oriented way, could you please point me to it?

Greetings,
Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

Rafael J. Wysocki

unread,

Dec 5, 2005, 6:30:09 PM12/5/05

to

Hi,

On Monday, 5 December 2005 18:29, Pavel Machek wrote:
> [BTW right function to modify is swsusp_shrink_memory, it is quite
> clear what it is doing, so finding formula that frees enough to make
> it fast but not so much to make it unresponsive should be easy.]

Agreed.

}-- snip --{

> Yes, it is not completely fair. But as I started to use X32 with good
> battery... well I'm not really using swsusp any more.

Unfortunately I can't make my box suspend to RAM ... :-(

> > > * compress the image. Needs to be done in userspace, so it needs
> > > uswsusp to be merged, first. Patches for that are available. Should
> > > speed it up about twice.

Frankly, I would think that compression could be done in the kernel.
Encryption is different, as it should depend on a user-provided key, but
compression does not seem to depend on anything "external", at first sight.

}-- snip --{

> If goal is "make it work with least effort", answer is of course
> suspend2; but I need someone to help me doing it right.

Well, in the Andy's case this may or may not help. Actually I'd like him to try
and say what's the result, but only if he's so kind, has some free time
to was^H^H^Hdo this, etc. ;-)

Greetings,
Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

-

Rafael J. Wysocki

unread,

Dec 5, 2005, 6:30:14 PM12/5/05

to

Hi,

On Monday, 5 December 2005 13:17, Pavel Machek wrote:
> > On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my laptop (1.25
> > GB, P4M 1.4 GHz) was a pretty fast process; freeing memory took about 3
> > seconds or less,

That is strange. Without the recent patches It takes _much_ more time on my
box, and I have "only" 768 MB or RAM.

> > and writing out the swap image took less than 5
> > seconds, so within 15 seconds of running my suspend script power was
> > off.
>
> So suspend took 15 second, and boot another 5 to read the image + 20
> first time desktops are switched. ... ~40 second total.
>
> > The downside was that after suspend, *everything* needed to be paged
> > back in, so all my apps were *very* slow for the first few interactions.
> > It would take about 15 or 20 seconds for Firefox to repaint the first
> > time I switched to its virtual desktop, and it was perceptibly slower
> > than normal for the next 5 or 10 minutes of use.

That's much, IMHO.

> > Now that I'm running 2.6.15-rc3-mm1, the page-in problem seems to be
> > largely gone; I don't notice a significant lagginess after resuming from
> > swsusp.
> >
> > But the suspend process is *slow*. It takes a good 20 or 30 seconds to
> > write out the image, which makes the overall suspend process take close
> > to a minute; it's writing about 400 MB, and my disk seems to only be
> > good for about 18 MB/sec according to hdparm -t.
>
> Lets say 20 seconds suspend, plus 20 seconds resume, and no time
> needed to switch the desktops. So it is ~40 seconds total, again ;-).

I think there's no point in doing such calculations. In fact, above certain
critical RAM size (call it X), the more RAM in the box, the _worse_ it gets when
we try to free as little memory as possible (let alone trying to save _all_
of it). The only question is how great is X.

The Andy's numbers suggest X \approx 1.5 GB, if I correctly remeber his dmesg
outputs. Therefore, even if the current code is as effective for him as the old
one, it _will_ _not_ be so for someone who has, say, 2 GB of RAM or more.
IOW, there is a point at which it gets reasonable to free memory before
suspend for performance reasons and it only remains uncertain where
that point actually is and how much memory should be freed for given
RAM size.

> > And, the resume is about the same amount slower, too.
> >
> > Certainly there's a tradeoff to be made, and I'm glad to lose the slow
> > re-paging after resume, but I'm hoping that some kind of improvement can
> > be made in the suspend/resume time.
>
> Of course, there are many ways to improve suspend. Some are easy, some
> are hard, some can be merged, and some can not.
>
> > Could we perhaps throw away *half* the cached memory rather than all of
> > it?
>
> Should be easy, mergeable and possibly very effective. Relevant code
> is in kernel/power/disk.c.

For this purpose we'll need to tamper with mm, I think, and that won't be
easy.

OTOH, we can get similar result by just making the kernel free some
more memory _after_ we are sure we have enough memory to suspend.
IOW, after the code that's currently in swsusp_shrink_memory() has finished,
we can try to free some "extra" memory to improve performance, if
needed. The question is how much "extra" memory should be freed and
I'm afraid it will have to be tuned on the per-system, or at least
per-RAM-size, basis.

I think I can write an experimental patch for that, if Andy agrees to test
it. ;-)

Greetings,
Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

-

Rafael J. Wysocki

unread,

Dec 5, 2005, 6:30:15 PM12/5/05

to

Hi,

On Monday, 5 December 2005 09:19, Andy Isaacson wrote:
> On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my laptop (1.25
> GB, P4M 1.4 GHz) was a pretty fast process; freeing memory took about 3
> seconds or less,

It took much more time on my box, but I won't discuss with your
experience. ;-)

> and writing out the swap image took less than 5
> seconds, so within 15 seconds of running my suspend script power was
> off.
>
> The downside was that after suspend, *everything* needed to be paged
> back in, so all my apps were *very* slow for the first few interactions.
> It would take about 15 or 20 seconds for Firefox to repaint the first
> time I switched to its virtual desktop, and it was perceptibly slower
> than normal for the next 5 or 10 minutes of use.
>
> Now that I'm running 2.6.15-rc3-mm1, the page-in problem seems to be
> largely gone; I don't notice a significant lagginess after resuming from
> swsusp.
>
> But the suspend process is *slow*. It takes a good 20 or 30 seconds to
> write out the image, which makes the overall suspend process take close
> to a minute; it's writing about 400 MB, and my disk seems to only be
> good for about 18 MB/sec according to hdparm -t.
>
> And, the resume is about the same amount slower, too.
>
> Certainly there's a tradeoff to be made, and I'm glad to lose the slow
> re-paging after resume, but I'm hoping that some kind of improvement can
> be made in the suspend/resume time.

Yes, there is a tradeoff. Till now, we have used the simplistic approach
based on freeing as much memory as possible before suspend. Now, we
are freeing only as much memory as necessary, which is on the other
end of the scale, so to speak. There are a whole lot of possibilities in
between, and there's a question which one is the best. Frankly, I'm afraid
the answer is very system-dependent.

If you want a quick solution, you can get back to the previous behavior by
commenting out the definition of FAST_FREE in kernel/power/power.h.

Alternatively, you can increase the value of PAGES_FOR_IO, defined
in include/linux/suspend.h. To see any effect, you'll probably have to
increase it by tens of thousands, but please note the box may be unable
to suspend if it's too great (if you try this anyway, please let me know what
number seems to be the best to you).

Also, I can create a patch to improve this a bit, if you promise to help
test/debug it. ;-)

Greetings,
Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

-

Pavel Machek

unread,

Dec 5, 2005, 6:40:17 PM12/5/05

to

Hi!

> > > > * compress the image. Needs to be done in userspace, so it needs
> > > > uswsusp to be merged, first. Patches for that are available. Should
> > > > speed it up about twice.
> > >
> > > That's not true at all. You have cryptoapi in kernel space and can
> > > easily use it - it's very similar code to what you already have for
> > > encryption. You won't get double the speed with with the deflate
> > > compressor - more like 2 or 3MB/s :(. Suspend2 gets double the speed
> > > because we use lzf, which is a logically distinction addition
> > > (implemented now as another cryptoapi plugin).
> >
> > Well, 3MB/sec improvement will save him seconds on 20, or something
> > like that, so I guess LZF *is* a way to go, and I'd like to keep that
> > out of kernel. And I will not accept compression into mainline swsusp;
> > did that experiment with encryption once already, and I did not like
> > the result much.
>
> No - I didn't mean a 3MB/s improvement. I meant that you'll get about
> 3MB/s throughput. It's _very_ slow. Of course having said that, I don't
> recall now what machine I saw that on. It may well be my 933 Celeron
> (Omnibook XE3).

Ah, okay -- that means that deflate compressor is pretty much useless.

> > If goal is "make it work with least effort", answer is of course
> > suspend2; but I need someone to help me doing it right.
>
> How do you think suspend2 does it wrong? Is it just that you think that
> everything belongs in userspace, or is there more to it?

Everything belongs in userspace... that makes it "wrong
enough". Userland and kernel programming is quite different, so any
improvements to suspend2 will be wasted, long-term. You'll make users
happy for now, but it means u-swsusp gets less users and less
developers, making "doing it right" slightly harder...
Pavel
--
Thanks, Sharp!

Pavel Machek

unread,

Dec 5, 2005, 7:00:15 PM12/5/05

to

Hi!

> > > Now that I'm running 2.6.15-rc3-mm1, the page-in problem seems to be
> > > largely gone; I don't notice a significant lagginess after resuming from
> > > swsusp.
> > >
> > > But the suspend process is *slow*. It takes a good 20 or 30 seconds to
> > > write out the image, which makes the overall suspend process take close
> > > to a minute; it's writing about 400 MB, and my disk seems to only be
> > > good for about 18 MB/sec according to hdparm -t.
> >
> > Lets say 20 seconds suspend, plus 20 seconds resume, and no time
> > needed to switch the desktops. So it is ~40 seconds total, again ;-).
>
> I think there's no point in doing such calculations. In fact, above certain
> critical RAM size (call it X), the more RAM in the box, the _worse_ it gets when
> we try to free as little memory as possible (let alone trying to save _all_
> of it). The only question is how great is X.

Agreed. And X depends on workload.

Several approaches make sense.

(Y is size of image. Obviously Y < half of ram and Y > some minimum
ammount of kernel data. My approach is Y as low as possible, your
approach is Y as high as possible.).

1) Try to make Y as much as possible, but 500MB max. Common user
workloads fit into 500MB, so user should get responsive system after
resume, but we'll not write excessive ammounts of data.

2) Only free memory that was not used in last 10 minutes. That should
keep system responsive enough after resume.

> > Should be easy, mergeable and possibly very effective. Relevant code
> > is in kernel/power/disk.c.
>
> For this purpose we'll need to tamper with mm, I think, and that won't be
> easy.
>
> OTOH, we can get similar result by just making the kernel free some
> more memory _after_ we are sure we have enough memory to suspend.
> IOW, after the code that's currently in swsusp_shrink_memory() has finished,
> we can try to free some "extra" memory to improve performance, if
> needed. The question is how much "extra" memory should be freed and
> I'm afraid it will have to be tuned on the per-system, or at least
> per-RAM-size, basis.

I'd prefer not to have extra tunables. "Write only 500MB" will work
okay for common desktop users -- as long as common desktop fits into
500MB, that is. "Free not used in last 10 minutes" should work okay
for everyone, but may be slightly harder to implement.
Pavel
--
Thanks, Sharp!

Pavel Machek

unread,

Dec 5, 2005, 7:10:08 PM12/5/05

to

Hi!

> > Yes, it is not completely fair. But as I started to use X32 with good
> > battery... well I'm not really using swsusp any more.
>
> Unfortunately I can't make my box suspend to RAM ... :-(

Yes, debugging suspend-to-RAM is hard, but it is not impossible. Try
with minimal config (noapic, 32-bit kernel, ...) then add
features. Hopefully minimum kernel will work...

> > > > * compress the image. Needs to be done in userspace, so it needs
> > > > uswsusp to be merged, first. Patches for that are available. Should
> > > > speed it up about twice.
>
> Frankly, I would think that compression could be done in the kernel.

Unfortunately cryptoapi only supports gzip compression, and that's
useless for swsusp. (Slows it down, 10 times). And adding LZW to
kernel just for swusp is wrong thing to do.

> > If goal is "make it work with least effort", answer is of course
> > suspend2; but I need someone to help me doing it right.
>
> Well, in the Andy's case this may or may not help. Actually I'd like him to try
> and say what's the result, but only if he's so kind, has some free time
> to was^H^H^Hdo this, etc. ;-)

~~~~~~~~~~~
I think I'm missing something here.
Pavel
--
Thanks, Sharp!

Andy Isaacson

unread,

Dec 5, 2005, 7:20:06 PM12/5/05

to

On Tue, Dec 06, 2005 at 12:05:04AM +0100, Rafael J. Wysocki wrote:
> On Monday, 5 December 2005 09:19, Andy Isaacson wrote:
> > On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my laptop (1.25
> > GB, P4M 1.4 GHz) was a pretty fast process; freeing memory took about 3
> > seconds or less,
>
> It took much more time on my box, but I won't discuss with your
> experience. ;-)

I may be misremembering, it didn't seem important at the time. Less
than 10 seconds, anyways.

> > Certainly there's a tradeoff to be made, and I'm glad to lose the slow
> > re-paging after resume, but I'm hoping that some kind of improvement can
> > be made in the suspend/resume time.
>
> Yes, there is a tradeoff. Till now, we have used the simplistic approach
> based on freeing as much memory as possible before suspend. Now, we
> are freeing only as much memory as necessary, which is on the other
> end of the scale, so to speak. There are a whole lot of possibilities in
> between, and there's a question which one is the best. Frankly, I'm afraid
> the answer is very system-dependent.

If you wanted to compute "What's the absolute perfect 99.9999th
percentile amount to free", yes, that would be impossibly
system-dependent.

But some kind of rule of thumb should get good results in most cases,
and it should be easy enough to add a knob for people who have
particular requirements.

> If you want a quick solution, you can get back to the previous behavior by
> commenting out the definition of FAST_FREE in kernel/power/power.h.

That's boring. :) The current behavior isn't bad enough to force me
back.

> Alternatively, you can increase the value of PAGES_FOR_IO, defined
> in include/linux/suspend.h. To see any effect, you'll probably have to
> increase it by tens of thousands, but please note the box may be unable
> to suspend if it's too great (if you try this anyway, please let me know what
> number seems to be the best to you).
>
> Also, I can create a patch to improve this a bit, if you promise to help
> test/debug it. ;-)

I'll play with this a bit tonight but I'd love to see a patch that makes
it a tunable. Rebooting my laptop is sloooow and annoying (due to slow
startup scripts and losing all my state), but trying various suspend
settings sounds like a fun experiment.

-andy

Pavel Machek

unread,

Dec 5, 2005, 8:00:13 PM12/5/05

to

Hi!

> > Alternatively, you can increase the value of PAGES_FOR_IO, defined
> > in include/linux/suspend.h. To see any effect, you'll probably have to
> > increase it by tens of thousands, but please note the box may be unable
> > to suspend if it's too great (if you try this anyway, please let me know what
> > number seems to be the best to you).
> >
> > Also, I can create a patch to improve this a bit, if you promise to help
> > test/debug it. ;-)
>
> I'll play with this a bit tonight but I'd love to see a patch that makes
> it a tunable. Rebooting my laptop is sloooow and annoying (due to slow
> startup scripts and losing all my state), but trying various suspend
> settings sounds like a fun experiment.

init=/bin/bash, and you can get rid of startup scripts ;-).
Pavel
--
Thanks, Sharp!

Nigel Cunningham

unread,

Dec 5, 2005, 8:20:12 PM12/5/05

to

Hi.

I agree that whether it's a good idea varies according to individual
tastes and usage. That's why I've made it configurable. The other point
to remember is that suspend2's I/O performance is much better. My
desktop here at work, for example, writes the image at 72MB/s and reads
it back at 116MB/s. (3GHz P4 with a Maxtor 6Y120L0). Writing 1GB at
these rates is not a problem.

> Second, IMHO, some things you are doing in suspend2, like image encryption,
> or accessing ordinary files, should not be implemented in the kernel.

Image encryption is just done using cryptoapi - I just expose the
parameters and optionally save them in the image; there's no nous in
suspend2 regarding encryption beyond that.

Regarding accessing ordinary files, it's really just a variation on the
swapwriter in that we bmap the storage and then use the blocks we're
given. There were two reasons for adding this - first removing the
dependency on available swapspace, and secondly working towards better
support for embedded (write the image to a file and include the file in
place of a ramdisk image). The second reason might sound like fluff, but
I can assure you as an embedded developer myself that embedded people
are really interested in seeing if this technique will be a viable way
of speeding up boot times.

> That said, I think at least some of the functionalities you have already
> implemented in suspend2 are needed and generally can be shared between
> your code and swsusp. I've been going to look for such possibilities for some
> time, but unfortunately, in its downloadable form, your patch is
> quite difficult to follow, so if you have a version that is organized in a more
> functionality-oriented way, could you please point me to it?

I'm working toward getting a git tree prepared, but it's taking quite a
while because I've been doing cleanups and so on at the same time. I
believe I've finally run out of new functionality to add (thankfully!),
so I'm now concentrating on bug blatting and on finishing off the git
tree preparation. I'm doing this all mostly in spare time though, so I'm
sorry but it's not ready yet. I could give you the collection of patches
as I have it at the moment, but it's being modified heavily pretty
constantly.

Regards,

Nigel

Nigel Cunningham

unread,

Dec 5, 2005, 8:40:05 PM12/5/05

to

Hi.

On Tue, 2005-12-06 at 09:34, Pavel Machek wrote:
> > > If goal is "make it work with least effort", answer is of course
> > > suspend2; but I need someone to help me doing it right.
> >
> > How do you think suspend2 does it wrong? Is it just that you think that
> > everything belongs in userspace, or is there more to it?
>
> Everything belongs in userspace... that makes it "wrong
> enough". Userland and kernel programming is quite different, so any
> improvements to suspend2 will be wasted, long-term. You'll make users
> happy for now, but it means u-swsusp gets less users and less
> developers, making "doing it right" slightly harder...

Ok. I guess I need help then in seeing why everything belongs in
userspace. Actually, let's revise that for a start - I know you don't
really mean everything, because even you still do the atomic copy in
kernel space... or are you planning on changing that too? :)

I'm not unwilling to be convinced - I just don't see why, with such a
lowlevel operation as suspending to disk, userspace is the place to put
everything. The preference for userspace seems to me to be just that - a
preference.

Regarding improvements to suspend2 being wasted long term, I actually
think that I could port at least part of it to userspace without too
much effort at all. My main concern would be exporting the information
and interfaces needed in a way that isn't ugly, is reliable and doesn't
open security holes. I'm not at all convinced that kmem meets those
criteria. But if you can show me a better way, I'll happily come on
board.

Regards,

Nigel

Pavel Machek

unread,

Dec 5, 2005, 8:40:09 PM12/5/05

to

Hi!

> > First, I don't think that saving a full image of memory is generally a good
> > idea. It is - for systems with relatively small RAM, but for systems with
> > more than, say, 512 MB that's questionable. Of course that depends a lot
> > on the usage patterns of particular system, but having 768 MB of RAM
> > in my box I wouldn't like it to save more than a half of it during suspend,
> > for performance reasons.
>
> I agree that whether it's a good idea varies according to individual
> tastes and usage. That's why I've made it configurable. The other point
> to remember is that suspend2's I/O performance is much better. My
> desktop here at work, for example, writes the image at 72MB/s and reads
> it back at 116MB/s. (3GHz P4 with a Maxtor 6Y120L0). Writing 1GB at
> these rates is not a problem.

Andy reported 20MB/sec on hdparm. I do not think it is possible to
write faster than that. And that seems about ok for notebooks, X32
(pretty new) has:

root@amd:~# hdparm -t /dev/hda

/dev/hda:
Timing buffered disk reads: 108 MB in 3.01 seconds = 35.85 MB/sec

> > Second, IMHO, some things you are doing in suspend2, like image encryption,
> > or accessing ordinary files, should not be implemented in the kernel.
>
> Image encryption is just done using cryptoapi - I just expose the
> parameters and optionally save them in the image; there's no nous in
> suspend2 regarding encryption beyond that.

Unfortunately all these "small things" add up.

> Regarding accessing ordinary files, it's really just a variation on the
> swapwriter in that we bmap the storage and then use the blocks we're
> given. There were two reasons for adding this - first removing the
> dependency on available swapspace, and secondly working towards better
> support for embedded (write the image to a file and include the file in
> place of a ramdisk image). The second reason might sound like fluff, but
> I can assure you as an embedded developer myself that embedded people
> are really interested in seeing if this technique will be a viable way
> of speeding up boot times.

Interesting use, but for embedded app, they can just reserve partition
as well. [I have seen some patches doing that.]
Pavel
--
Thanks, Sharp!

Nigel Cunningham

unread,

Dec 5, 2005, 8:50:06 PM12/5/05

to

Hi again.

On Tue, 2005-12-06 at 08:28, Rafael J. Wysocki wrote:

> Hi,
>
> On Monday, 5 December 2005 14:58, Nigel Cunningham wrote:
> > On Mon, 2005-12-05 at 22:17, Pavel Machek wrote:
> > > > On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my laptop (1.25
> > > > GB, P4M 1.4 GHz) was a pretty fast process; freeing memory took about 3
> > > > seconds or less, and writing out the swap image took less than 5
> > > > seconds, so within 15 seconds of running my suspend script power was
> > > > off.
> > >
> > > So suspend took 15 second, and boot another 5 to read the image + 20
> > > first time desktops are switched. ... ~40 second total.
> >
> > Plus what is mentioned in the next paragraph.
>
> Indeed. Yet, the point has been made and backed up with some numbers:
> There's at least one swsusp user (Andy) who would apparently _prefer_ if more
> memory were freed during suspend. The reason is the amount of
> RAM in the Andy's box.

Perhaps I wasn't clear enough. I was arguing that if you get your prompt
back in 40s, but the computer is still thrashing for the next minute or
ten, you haven't really finished resuming yet.

Regards,

Nigel

Andy Isaacson

unread,

Dec 5, 2005, 8:50:07 PM12/5/05

to

On Tue, Dec 06, 2005 at 02:37:59AM +0100, Pavel Machek wrote:
> > desktop here at work, for example, writes the image at 72MB/s and reads
> > it back at 116MB/s. (3GHz P4 with a Maxtor 6Y120L0). Writing 1GB at
> > these rates is not a problem.

Hmm, I only wish...

> Andy reported 20MB/sec on hdparm. I do not think it is possible to
> write faster than that. And that seems about ok for notebooks, X32
> (pretty new) has:
>
> root@amd:~# hdparm -t /dev/hda

You named your X32 "amd"? How ... confusing ... (assuming it, like all
other Thinkpad X series I know, has a Pentium M.)

> /dev/hda:
> Timing buffered disk reads: 108 MB in 3.01 seconds = 35.85 MB/sec

That's quite a bit better than mine, and I am pretty sure I am the same
vintage or newer (purchased this summer), but I'm getting barely half
that speed:
Timing buffered disk reads: 58 MB in 3.10 seconds = 18.70 MB/sec

How can I find out what disk is in this beast and try to track down some
of my missing performance? It looks like I have the right DMA settings
and whatnot looking at hdparm(1).

-andy

Pavel Machek

unread,

Dec 5, 2005, 9:00:14 PM12/5/05

to

Hi!

> > > > If goal is "make it work with least effort", answer is of course
> > > > suspend2; but I need someone to help me doing it right.
> > >
> > > How do you think suspend2 does it wrong? Is it just that you think that
> > > everything belongs in userspace, or is there more to it?
> >
> > Everything belongs in userspace... that makes it "wrong
> > enough". Userland and kernel programming is quite different, so any
> > improvements to suspend2 will be wasted, long-term. You'll make users
> > happy for now, but it means u-swsusp gets less users and less
> > developers, making "doing it right" slightly harder...
>
> Ok. I guess I need help then in seeing why everything belongs in
> userspace. Actually, let's revise that for a start - I know you don't
> really mean everything, because even you still do the atomic copy in
> kernel space... or are you planning on changing that too? :)

Unfortunately atomic copy can not be moved to userspace, nor can be
driver handling moved. [If I knew how to do it completely from
userspace, I definitely would, BTW.]

> I'm not unwilling to be convinced - I just don't see why, with such a
> lowlevel operation as suspending to disk, userspace is the place to put
> everything. The preference for userspace seems to me to be just that - a
> preference.

Its actually pretty hard rule. It should be userspace if it reasonably
can. It should be userspace if it contains policy.

suspend2 contains bits of policy -- esc to cancel, compression
options, encryption options. It is not unreasonable that someone would
want any key to cancel, or some other compression, or something like
that. You actually put progressbar code into userspace -- good
step. So you have all the complexity of userspace running during
suspend/resume (it is not that bad), but are not using full benefits
-- move more code to userspace.

Practical benefits of u-swsusp should be:

* no need to split userland parts in small pieces.

* it is easier to develop in userspace.

* access to libraries -- adding LZW or whatever is not a problem. No
need to convert those libraries to kernel coding style.

* it is possible to have multiple version. In kernel, only one version
can exist, but in userland, it is okay to have few implementations.

> Regarding improvements to suspend2 being wasted long term, I actually
> think that I could port at least part of it to userspace without too
> much effort at all. My main concern would be exporting the information
> and interfaces needed in a way that isn't ugly, is reliable and doesn't
> open security holes. I'm not at all convinced that kmem meets those
> criteria. But if you can show me a better way, I'll happily come on
> board.

I have thought about this. We already mark pages for suspend with
PageNosave | PageNosaveFree; we could only allow access to pages
marked like this by /dev/suspend, or something like that.

Rafael thinks we should provide nicer interface to userspace; that's
okay with me as long as it is not slow or too much of code.

Pavel
--
Thanks, Sharp!

Pavel Machek

unread,

Dec 5, 2005, 9:00:18 PM12/5/05

to

Hi!

> > Andy reported 20MB/sec on hdparm. I do not think it is possible to
> > write faster than that. And that seems about ok for notebooks, X32
> > (pretty new) has:
> >
> > root@amd:~# hdparm -t /dev/hda
>
> You named your X32 "amd"? How ... confusing ... (assuming it, like all
> other Thinkpad X series I know, has a Pentium M.)

Historical reasons, sorry. I had athlon64 machine as my main
workstation, then just cp -a'ed it to new system. It is also called
Arima when connected to network, while it is IBM system...

> > /dev/hda:
> > Timing buffered disk reads: 108 MB in 3.01 seconds = 35.85 MB/sec
>
> That's quite a bit better than mine, and I am pretty sure I am the same
> vintage or newer (purchased this summer), but I'm getting barely half
> that speed:
> Timing buffered disk reads: 58 MB in 3.10 seconds = 18.70
> MB/sec

Below are data from my machine... but that should be moved to
linux-ide or something. This thinkpad is from this summer, too, BTW.

Pavel

root@amd:~# hdparm /dev/hda

/dev/hda:
multcount = 0 (off)
IO_support = 1 (32-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 256 (on)
geometry = 65535/16/63, sectors = 78140160, start = 0
root@amd:~# hdparm -i /dev/hda

/dev/hda:

Model=HTS541040G9AT00, FwRev=MB2IA5BJ, SerialNo=MPB2L0X2GLMG5M
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=DualPortCache, BuffSize=7539kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78140160
IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
Drive conforms to: ATA/ATAPI-6 T13 1410D revision 3a:

* signifies the current active mode

Pavel

--
Thanks, Sharp!

Nigel Cunningham

unread,

Dec 5, 2005, 9:10:05 PM12/5/05

to

Hi Andy.

On Tue, 2005-12-06 at 11:47, Andy Isaacson wrote:
> On Tue, Dec 06, 2005 at 02:37:59AM +0100, Pavel Machek wrote:
> > > desktop here at work, for example, writes the image at 72MB/s and reads
> > > it back at 116MB/s. (3GHz P4 with a Maxtor 6Y120L0). Writing 1GB at
> > > these rates is not a problem.
>
> Hmm, I only wish...
>
> > Andy reported 20MB/sec on hdparm. I do not think it is possible to
> > write faster than that. And that seems about ok for notebooks, X32
> > (pretty new) has:
> >
> > root@amd:~# hdparm -t /dev/hda
>
> You named your X32 "amd"? How ... confusing ... (assuming it, like all
> other Thinkpad X series I know, has a Pentium M.)
>
> > /dev/hda:
> > Timing buffered disk reads: 108 MB in 3.01 seconds = 35.85 MB/sec
>
> That's quite a bit better than mine, and I am pretty sure I am the same
> vintage or newer (purchased this summer), but I'm getting barely half
> that speed:
> Timing buffered disk reads: 58 MB in 3.10 seconds = 18.70 MB/sec
>
> How can I find out what disk is in this beast and try to track down some
> of my missing performance? It looks like I have the right DMA settings
> and whatnot looking at hdparm(1).

What RPM does the drive spin at? Using hdparm -I to get the model
number, then Google to get the specs. If it's like most laptop HDDs, it
probably only spins at 4200RPM. My original drive in my Omnibook was
like that - 4200RPM, ATA66. The best it would do (according to hdparm
-t) was about 17 or 18MB/s raw. Pavel's better numbers are the same as
what I get in my laptop now, with a 7200RPM drive. Here ATA66 appears to
have become the limiting factor, so that I get about 35 or 36MB/s too.
The desktop drive I mentioned above is ATA133, and does about 56MB/s
raw. The higher numbers above (72/116) come from having a fast cpu
(de)compressing.

In my case, because my hard drive is fast, I don't gain much from
compressing the image. I actually write is slower while compressing, and
get a bit of a gain while uncompressing. In your case though, since your
harddrive is slower, you'll benefit more from compressing the image,
getting closer to the doubling that we've spoken of before. (Doubling
coming from the compression ratio that LZF achieves generally being
around 50%).

Hope this helps.

Nigel.

Andy Isaacson

unread,

Dec 5, 2005, 9:10:13 PM12/5/05

to

On Tue, Dec 06, 2005 at 11:36:13AM +1000, Nigel Cunningham wrote:
> > > > > On recent kernels such as 2.6.14-rc2-mm1, a swsusp of my
> > > > > laptop (1.25 GB, P4M 1.4 GHz) was a pretty fast process;
> > > > > freeing memory took about 3 seconds or less, and writing out
> > > > > the swap image took less than 5 seconds, so within 15 seconds
> > > > > of running my suspend script power was off.
> > > >
> > > > So suspend took 15 second, and boot another 5 to read the image + 20
> > > > first time desktops are switched. ... ~40 second total.
> > >
> > > Plus what is mentioned in the next paragraph.
> >
> > Indeed. Yet, the point has been made and backed up with some numbers:
> > There's at least one swsusp user (Andy) who would apparently
> > _prefer_ if more memory were freed during suspend. The reason is
> > the amount of RAM in the Andy's box.
>
> Perhaps I wasn't clear enough. I was arguing that if you get your prompt
> back in 40s, but the computer is still thrashing for the next minute or
> ten, you haven't really finished resuming yet.

I started this thread to complain about the increase in time from "zzz"
to "power's off" (which has increased approximately 2x to 3x in
2.6.15-rc3-mm1), but it's also relevant to consider restart time and
performance (which was the genesis of the changes in 15-rc3-mm1 IIUC).

Previous kernels (2.6.14-rc2-mm1) got me back to the prompt faster, but
at the cost of leaving most of a GB of memory unused (and forcing me to
manually page stuff in as I used it over the next half hour).

Newer kernels write and read a bigger image, which makes the prompt show
up somewhat later, but gives the benefit of putting me back in
approximately the same place I left off with regards to working set.

I would like the best of both worlds - I want my suspend to go faster
(so I want a smaller image), and I also want my working set paged back
in after resume.

I'm assuming that the difference is that with Rafael's patches, clean
pages that would have been evicted in the "freeing pages..." step are
now being written out to the swsusp image. If so, this is a waste - no
point in having the data on disk twice. (It would be nice to confirm
this suspicion.)

Could we rework it to avoid writing clean pages out to the swsusp image,
but keep a list of those pages and read them back in *after* having
resumed? Maybe do the /dev/initrd ('less +/once Documentation/initrd.txt'
if you're not familiar with it) trick to make the list of pages available
to a userland helper.

Someone suggested the 'cat `grep / /proc/*/maps`' trick. This kills the
working set calculations that the kernel has so painstakingly built up,
reading in all kinds of pages that were flushed with good reason, and
also fails to get my Mercurial .d files back into cache, since they are
not mapped by any long-running process.

-andy

Nigel Cunningham

unread,

Dec 5, 2005, 9:20:08 PM12/5/05

to

Hi.

On Tue, 2005-12-06 at 11:37, Pavel Machek wrote:
> Hi!
>
> > > First, I don't think that saving a full image of memory is generally a good
> > > idea. It is - for systems with relatively small RAM, but for systems with
> > > more than, say, 512 MB that's questionable. Of course that depends a lot
> > > on the usage patterns of particular system, but having 768 MB of RAM
> > > in my box I wouldn't like it to save more than a half of it during suspend,
> > > for performance reasons.
> >
> > I agree that whether it's a good idea varies according to individual
> > tastes and usage. That's why I've made it configurable. The other point
> > to remember is that suspend2's I/O performance is much better. My
> > desktop here at work, for example, writes the image at 72MB/s and reads
> > it back at 116MB/s. (3GHz P4 with a Maxtor 6Y120L0). Writing 1GB at
> > these rates is not a problem.
>
> Andy reported 20MB/sec on hdparm. I do not think it is possible to
> write faster than that. And that seems about ok for notebooks, X32
> (pretty new) has:

Depending upon what speed his CPU is, he should be able to achieve close
to 40MB/s with LZF compression (assuming 50% compression and a CPU fast
enough that the disk continues to be the bottleneck).

> root@amd:~# hdparm -t /dev/hda
>
> /dev/hda:
> Timing buffered disk reads: 108 MB in 3.01 seconds = 35.85 MB/sec
>
> > > Second, IMHO, some things you are doing in suspend2, like image encryption,
> > > or accessing ordinary files, should not be implemented in the kernel.
> >
> > Image encryption is just done using cryptoapi - I just expose the
> > parameters and optionally save them in the image; there's no nous in
> > suspend2 regarding encryption beyond that.
>
> Unfortunately all these "small things" add up.

But so does doing it from userspace - you then have to make the pages
available to the userspace program, implement encryption there, provide
safety nets in case userspace dies unexpectedly and so on. There is a
cost to encryption that occurs regardless of where we do the
compressing.

> > Regarding accessing ordinary files, it's really just a variation on the
> > swapwriter in that we bmap the storage and then use the blocks we're
> > given. There were two reasons for adding this - first removing the
> > dependency on available swapspace, and secondly working towards better
> > support for embedded (write the image to a file and include the file in
> > place of a ramdisk image). The second reason might sound like fluff, but
> > I can assure you as an embedded developer myself that embedded people
> > are really interested in seeing if this technique will be a viable way
> > of speeding up boot times.
>
> Interesting use, but for embedded app, they can just reserve partition
> as well. [I have seen some patches doing that.]

For swap?

Regards,

Nigel

Nigel Cunningham

unread,

Dec 5, 2005, 9:30:13 PM12/5/05

to

Hi.

On Tue, 2005-12-06 at 12:06, Andy Isaacson wrote:
> On Tue, Dec 06, 2005 at 11:36:13AM +1000, Nigel Cunningham wrote:
> I'm assuming that the difference is that with Rafael's patches, clean
> pages that would have been evicted in the "freeing pages..." step are
> now being written out to the swsusp image. If so, this is a waste - no
> point in having the data on disk twice. (It would be nice to confirm
> this suspicion.)

Forgot to mention - that's true.

Regards,

Nigel

Nigel Cunningham

unread,

Dec 5, 2005, 9:30:14 PM12/5/05

to

Hi. Tue, 2005-12-06 at 12:06, Andy Isaacson wrote:
> Could we rework it to avoid writing clean pages out to the swsusp image,
> but keep a list of those pages and read them back in *after* having
> resumed? Maybe do the /dev/initrd ('less +/once Documentation/initrd.txt'
> if you're not familiar with it) trick to make the list of pages available
> to a userland helper.

The problem is that once you let userspace run, you have absolutely no
control over what pages are read from or written to, and if a userspace
app assumes that data is there in a page when it isn't, you have a
recipe for an oops at best, and possibly for on disk corruption. Pages
that haven't been read yet could conceivably be set up to be treated as
faults, but then we're making things a lot more complicated than Pavel
or I want to do. That's why a good portion of the improvements in
Suspend2 have concentrated on making the process work faster - doing
more than one I/O at a time results in a good performance improvement.

Regards,

Nigel

Mark Lord

unread,

Dec 5, 2005, 11:00:11 PM12/5/05

to

Andy Isaacson wrote:
>
>>/dev/hda:
>> Timing buffered disk reads: 108 MB in 3.01 seconds = 35.85 MB/sec
>
> That's quite a bit better than mine, and I am pretty sure I am the same
> vintage or newer (purchased this summer), but I'm getting barely half
> that speed:
> Timing buffered disk reads: 58 MB in 3.10 seconds = 18.70 MB/sec
>
> How can I find out what disk is in this beast and try to track down some
> of my missing performance?

hdparm -I /dev/hda (or cat /proc/ide/hda/model)

The FUJITSU MHV2100AH in my Dell i9300 gives 57.5 MB/sec with hdparm.
Note that hdparm does not read in the most efficient manner
(so the drive is likely even faster than that),
and writing to drives is normally slower than reading.

Cheers

Andy Isaacson

unread,

Dec 6, 2005, 1:30:12 AM12/6/05

to

On Tue, Dec 06, 2005 at 02:56:16AM +0100, Pavel Machek wrote:
> Below are data from my machine... but that should be moved to
> linux-ide or something. This thinkpad is from this summer, too, BTW.
>

> /dev/hda:
> Model=HTS541040G9AT00, FwRev=MB2IA5BJ, SerialNo=MPB2L0X2GLMG5M

Alas, the reason for my poor disk performance becomes clear - it's a
1.8" drive! Gah, if I'd known that I wouldn't have bought this laptop
(though I *do* like it in other regards).

/dev/hda:

Model=HTC426040G9AT00, FwRev=00P4A0B4, SerialNo=121611
Config={ Fixed }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78140160
IORDY=yes, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}

PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
Drive conforms to: ATA/ATAPI-6 T13 1410D revision 3a:

* signifies the current active mode

-andy

Pavel Machek

unread,

Dec 6, 2005, 7:00:11 AM12/6/05

to

Hi!

> > Below are data from my machine... but that should be moved to
> > linux-ide or something. This thinkpad is from this summer, too, BTW.
> >
> > /dev/hda:
> > Model=HTS541040G9AT00, FwRev=MB2IA5BJ, SerialNo=MPB2L0X2GLMG5M
>
> Alas, the reason for my poor disk performance becomes clear - it's a
> 1.8" drive! Gah, if I'd known that I wouldn't have bought this laptop
> (though I *do* like it in other regards).

Don't go near sharp zaurus then. It has compact-flash-sized harddrive,
with 2MB/sec max. I still like that machine, through.
Pavel
--
Thanks, Sharp!

Pavel Machek

unread,

Dec 6, 2005, 7:20:16 AM12/6/05

to

Hi!

> Newer kernels write and read a bigger image, which makes the prompt show
> up somewhat later, but gives the benefit of putting me back in
> approximately the same place I left off with regards to working set.
>
> I would like the best of both worlds - I want my suspend to go faster
> (so I want a smaller image), and I also want my working set paged back
> in after resume.
>
> I'm assuming that the difference is that with Rafael's patches, clean
> pages that would have been evicted in the "freeing pages..." step are
> now being written out to the swsusp image. If so, this is a waste - no
> point in having the data on disk twice. (It would be nice to confirm
> this suspicion.)

Confirmed. But you are wrong; it is not a waste. The pages are nicely
linear in suspend image, while they would be all over the disk
otherwise. There can easily be factor 20 difference between linear
read and random read.

> Could we rework it to avoid writing clean pages out to the swsusp image,
> but keep a list of those pages and read them back in *after* having
> resumed? Maybe do the /dev/initrd ('less +/once Documentation/initrd.txt'
> if you're not familiar with it) trick to make the list of pages available
> to a userland helper.

I did not understand this one.

> Someone suggested the 'cat `grep / /proc/*/maps`' trick. This kills the
> working set calculations that the kernel has so painstakingly built up,
> reading in all kinds of pages that were flushed with good reason, and
> also fails to get my Mercurial .d files back into cache, since they are
> not mapped by any long-running process.

Yes, that was very rough approximation.

Anyway, try limiting size of image to ~500MB, first. Should solve your
problem with very little work.
Pavel
--
Thanks, Sharp!

Pavel Machek

unread,

Dec 6, 2005, 7:20:16 AM12/6/05

to

Hi!

> > root@amd:~# hdparm -t /dev/hda
> >
> > /dev/hda:
> > Timing buffered disk reads: 108 MB in 3.01 seconds = 35.85 MB/sec
> >
> > > > Second, IMHO, some things you are doing in suspend2, like image encryption,
> > > > or accessing ordinary files, should not be implemented in the kernel.
> > >
> > > Image encryption is just done using cryptoapi - I just expose the
> > > parameters and optionally save them in the image; there's no nous in
> > > suspend2 regarding encryption beyond that.
> >
> > Unfortunately all these "small things" add up.
>
> But so does doing it from userspace - you then have to make the pages
> available to the userspace program, implement encryption there, provide
> safety nets in case userspace dies unexpectedly and so on. There is a
> cost to encryption that occurs regardless of where we do the
> compressing.

Well, doing it in userspace means that you only pay kernel complexity
once; and then you can get encryption, compression, suspend-to-file
for free. And amount of kernel changes is surprisingly small.

Userspace recovery is not big problem, btw. First, userspace should
just work. It is doing suspend to disk, so it should better not
fail. Fortunately, during debugging I found out that being userspace
has big advantages: you can still use usual recover techniques after
segfault.

> > Interesting use, but for embedded app, they can just reserve partition
> > as well. [I have seen some patches doing that.]
>
> For swap?

Yes. And then add some hacks to swapoff as soon as image is restored.

Pavel
--
Thanks, Sharp!

Pavel Machek

unread,

Dec 6, 2005, 9:30:10 AM12/6/05

to

Hi!

> Hi. Tue, 2005-12-06 at 12:06, Andy Isaacson wrote:
> > Could we rework it to avoid writing clean pages out to the swsusp image,
> > but keep a list of those pages and read them back in *after* having
> > resumed? Maybe do the /dev/initrd ('less +/once Documentation/initrd.txt'
> > if you're not familiar with it) trick to make the list of pages available
> > to a userland helper.
>
> The problem is that once you let userspace run, you have absolutely no
> control over what pages are read from or written to, and if a userspace
> app assumes that data is there in a page when it isn't, you have a
> recipe for an oops at best, and possibly for on disk
> corruption. Pages

No, that will not be a problem. You just resume system as you do now,
most pages will be not there. *But kernel knows it is not there*, and
will on-demand load them back. It will be normal userland application
doing readback. There's absolutely no risk of corruption.

Imagine something that saves list of needed pages before suspend, then
does something like

cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` > /dev/null

...it should work pretty well. And worst thing it can do is send your
system thrashing.

Pavel
--
Thanks, Sharp!

Mark Lord

unread,

Dec 6, 2005, 10:10:17 AM12/6/05

to

Mark Lord wrote:
>
> The FUJITSU MHV2100AH in my Dell i9300 gives 57.5 MB/sec with hdparm.

Whoah! keyboard out of control and bad eyesight. 37.5, not 57.5 !!

Andy Isaacson

unread,

Dec 6, 2005, 1:20:11 PM12/6/05

to

On Tue, Dec 06, 2005 at 01:18:35PM +0100, Pavel Machek wrote:
> > I'm assuming that the difference is that with Rafael's patches, clean
> > pages that would have been evicted in the "freeing pages..." step are
> > now being written out to the swsusp image. If so, this is a waste - no
> > point in having the data on disk twice. (It would be nice to confirm
> > this suspicion.)
>
> Confirmed. But you are wrong; it is not a waste. The pages are nicely
> linear in suspend image, while they would be all over the disk
> otherwise. There can easily be factor 20 difference between linear
> read and random read.

Agreed, linear reads are obviously an enormous improvement over seeking
all over the disk. (Especially given my 15 ms seek latency.) It would
suck to have to do all those seeks synchronously (before allowing the
swsusp resume to complete). But see below for my suggested alternative.

> > Could we rework it to avoid writing clean pages out to the swsusp image,
> > but keep a list of those pages and read them back in *after* having
> > resumed? Maybe do the /dev/initrd ('less +/once Documentation/initrd.txt'
> > if you're not familiar with it) trick to make the list of pages available
> > to a userland helper.
>
> I did not understand this one.

I'm suggesting that rather than writing the clean pages out to the
image, simply make their metadata available to a post-resume userland
helper. Something like

% head -2 /dev/swsusp-helper
/bin/sh 105-115 192 199-259
/lib/libc-2.3.2.so 1-250

where the userland program is expected to use the list of page numbers
(and getpagesize(2)) to asynchronously page in the working set in an
ionice'd manner.

This doesn't get rid of the seeks, of course, but doing them post-resume
will improve interactive performance while avoiding the cost of gigantic
swsusp images.

> Anyway, try limiting size of image to ~500MB, first. Should solve your
> problem with very little work.

This is obviously the right thing for my situation, and it's on my list.

-andy

Rafael J. Wysocki

unread,

Dec 6, 2005, 8:10:06 PM12/6/05

to

Hi,

On Tuesday, 6 December 2005 19:15, Andy Isaacson wrote:
}-- snip --{

>
> I'm suggesting that rather than writing the clean pages out to the
> image, simply make their metadata available to a post-resume userland
> helper. Something like
>
> % head -2 /dev/swsusp-helper
> /bin/sh 105-115 192 199-259
> /lib/libc-2.3.2.so 1-250
>
> where the userland program is expected to use the list of page numbers
> (and getpagesize(2)) to asynchronously page in the working set in an
> ionice'd manner.

The helper is not necessary, I think.

What we can do is to skip blank pages while writing the image and only use
place holders for them in metadata. Then we can make them blank again
when we load the image into memory.

Still, this would require some considerable changes in the swap-handling
part of swsusp and lots of testing.

Greetings,
Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

Pavel Machek

unread,

Dec 6, 2005, 8:20:08 PM12/6/05

to

Hi!

> > I'm suggesting that rather than writing the clean pages out to the
> > image, simply make their metadata available to a post-resume userland
> > helper. Something like
> >
> > % head -2 /dev/swsusp-helper
> > /bin/sh 105-115 192 199-259
> > /lib/libc-2.3.2.so 1-250
> >
> > where the userland program is expected to use the list of page numbers
> > (and getpagesize(2)) to asynchronously page in the working set in an
> > ionice'd manner.
>
> The helper is not necessary, I think.

Actually, I like the helper. It is safest solution, and list of cached
pages in memory is going to be usefull for other stuff, too.

Imagine:

cat /dev/give-me-list-of-pages-in-page-cache > /tmp/delme.suspend
echo disk > /sys/power/state
nice ( cat /tmp/delme.suspend | read-those-pages-back ) &

Result is quite obviously safe (unless you mess up locking while
dumping pagecache), and it is going to be rather easy to test. Just
load the system as much as you can while doing

while true; do cat /dev/give-me-list-of-pages-in-page-cache >
/dev/null; done

. Still, limiting image size to 500MB is probably easier solution. I'm
looking forward to that page.
Pavel
--
Thanks, Sharp!

Rafael J. Wysocki

unread,

Dec 7, 2005, 6:20:13 AM12/7/05

to

Hi,

On Wednesday, 7 December 2005 02:10, Pavel Machek wrote:
> > > I'm suggesting that rather than writing the clean pages out to the
> > > image, simply make their metadata available to a post-resume userland
> > > helper. Something like
> > >
> > > % head -2 /dev/swsusp-helper
> > > /bin/sh 105-115 192 199-259
> > > /lib/libc-2.3.2.so 1-250
> > >
> > > where the userland program is expected to use the list of page numbers
> > > (and getpagesize(2)) to asynchronously page in the working set in an
> > > ionice'd manner.
> >
> > The helper is not necessary, I think.
>
> Actually, I like the helper. It is safest solution,

No, it's not.

Let me explain what I have in mind.

For starters, please observe that the addresses we use are page-aligned,
so the least significant bit is always zero. Thus it can be used as a marker.

Now before we save the image we can mark blank pages by setting
the least significant bit of .orig_address to 1 in the coresponding PBEs.
We save the "marked" .orig_address values to the image.

Then, when we are about to save the page, we check the least
significant bit of its .orig_address, and save it only if this bit is zero.

When we are about to load a page, we first get a _zeroed_ page for it.
Next, we check if its .orig_address has the least significant bit set.
If not, we load the page, and otherwise we only clear that bit
(the page is already zero).

> and list of cached
> pages in memory is going to be usefull for other stuff, too.
>
> Imagine:
>
> cat /dev/give-me-list-of-pages-in-page-cache > /tmp/delme.suspend
> echo disk > /sys/power/state
> nice ( cat /tmp/delme.suspend | read-those-pages-back ) &
>
> Result is quite obviously safe (unless you mess up locking while
> dumping pagecache), and it is going to be rather easy to test. Just
> load the system as much as you can while doing
>
> while true; do cat /dev/give-me-list-of-pages-in-page-cache >
> /dev/null; done
>
> . Still, limiting image size to 500MB is probably easier solution. I'm
> looking forward to that page.

This is in the works.

Greetings,
Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

Pavel Machek

unread,

Dec 7, 2005, 6:40:06 AM12/7/05

to

Hi!

> > > > I'm suggesting that rather than writing the clean pages out to the
> > > > image, simply make their metadata available to a post-resume userland
> > > > helper. Something like
> > > >
> > > > % head -2 /dev/swsusp-helper
> > > > /bin/sh 105-115 192 199-259
> > > > /lib/libc-2.3.2.so 1-250
> > > >
> > > > where the userland program is expected to use the list of page numbers
> > > > (and getpagesize(2)) to asynchronously page in the working set in an
> > > > ionice'd manner.
> > >
> > > The helper is not necessary, I think.
> >
> > Actually, I like the helper. It is safest solution,
>
> No, it's not.
>
> Let me explain what I have in mind.
>
> For starters, please observe that the addresses we use are page-aligned,
> so the least significant bit is always zero. Thus it can be used as a marker.
>
> Now before we save the image we can mark blank pages by setting
> the least significant bit of .orig_address to 1 in the coresponding PBEs.
> We save the "marked" .orig_address values to the image.

Well, nice optimalization, but how many pages are actually full of
zeros? Above has advantage of working with any "clean" pages -- like
text pages of /bin/bash etc. And if done right it will not be
intrusive...

Pavel
--
Thanks, Sharp!

Rafael J. Wysocki

unread,

Dec 7, 2005, 7:00:38 AM12/7/05

to

Hi,

On Tuesday, 6 December 2005 00:55, Pavel Machek wrote:
}-- snip --{
> > OTOH, we can get similar result by just making the kernel free some
> > more memory _after_ we are sure we have enough memory to suspend.
> > IOW, after the code that's currently in swsusp_shrink_memory() has finished,
> > we can try to free some "extra" memory to improve performance, if
> > needed. The question is how much "extra" memory should be freed and
> > I'm afraid it will have to be tuned on the per-system, or at least
> > per-RAM-size, basis.
>
> I'd prefer not to have extra tunables. "Write only 500MB" will work
> okay for common desktop users -- as long as common desktop fits into
> 500MB, that is. "Free not used in last 10 minutes" should work okay
> for everyone, but may be slightly harder to implement.

Still, it can be done with a fairly small patch that has an additional
advantage, as it allows us to get rid of the FAST_FREE constant
which I don't like. Appended (untested).

Greetings,
Rafael

kernel/power/power.h | 8 +++-----
kernel/power/swsusp.c | 16 ++++++++--------
2 files changed, 11 insertions(+), 13 deletions(-)

Index: linux-2.6.15-rc5-mm1/kernel/power/power.h
===================================================================
--- linux-2.6.15-rc5-mm1.orig/kernel/power/power.h 2005-12-05 22:07:12.000000000 +0100
+++ linux-2.6.15-rc5-mm1/kernel/power/power.h 2005-12-07 12:45:04.000000000 +0100
@@ -53,12 +53,10 @@
extern struct pbe *pagedir_nosave;

/*
- * This compilation switch determines the way in which memory will be freed
- * during suspend. If defined, only as much memory will be freed as needed
- * to complete the suspend, which will make it go faster. Otherwise, the
- * largest possible amount of memory will be freed.
+ * Preferred image size in MB (set it to zero to get the smallest
+ * image possible)
*/
-#define FAST_FREE 1
+#define IMAGE_SIZE 500

extern asmlinkage int swsusp_arch_suspend(void);
extern asmlinkage int swsusp_arch_resume(void);
Index: linux-2.6.15-rc5-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.15-rc5-mm1.orig/kernel/power/swsusp.c 2005-12-05 22:07:12.000000000 +0100
+++ linux-2.6.15-rc5-mm1/kernel/power/swsusp.c 2005-12-07 12:40:27.000000000 +0100
@@ -626,6 +626,7 @@

int swsusp_shrink_memory(void)
{
+ unsigned long size;
long tmp;
struct zone *zone;
unsigned long pages = 0;
@@ -634,11 +635,11 @@

printk("Shrinking memory... ");
do {
-#ifdef FAST_FREE
- tmp = 2 * count_highmem_pages();
- tmp += tmp / 50 + count_data_pages();
- tmp += (tmp + PBES_PER_PAGE - 1) / PBES_PER_PAGE +
+ size = 2 * count_highmem_pages();
+ size += size / 50 + count_data_pages();
+ size += (size + PBES_PER_PAGE - 1) / PBES_PER_PAGE +
PAGES_FOR_IO;
+ tmp = size;
for_each_zone (zone)
if (!is_highmem(zone))
tmp -= zone->free_pages;
@@ -647,11 +648,10 @@
if (!tmp)
return -ENOMEM;
pages += tmp;
+ } else if (size > (IMAGE_SIZE * 1024 * 1024) / PAGE_SIZE) {
+ tmp = shrink_all_memory(SHRINK_BITE);
+ pages += tmp;
}
-#else
- tmp = shrink_all_memory(SHRINK_BITE);
- pages += tmp;
-#endif
printk("\b%c", p[i++%4]);
} while (tmp > 0);
printk("\bdone (%lu pages freed)\n", pages);

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

Pavel Machek

unread,

Dec 7, 2005, 7:10:19 AM12/7/05

to

Hi!

> > > OTOH, we can get similar result by just making the kernel free some
> > > more memory _after_ we are sure we have enough memory to suspend.
> > > IOW, after the code that's currently in swsusp_shrink_memory() has finished,
> > > we can try to free some "extra" memory to improve performance, if
> > > needed. The question is how much "extra" memory should be freed and
> > > I'm afraid it will have to be tuned on the per-system, or at least
> > > per-RAM-size, basis.
> >
> > I'd prefer not to have extra tunables. "Write only 500MB" will work
> > okay for common desktop users -- as long as common desktop fits into
> > 500MB, that is. "Free not used in last 10 minutes" should work okay
> > for everyone, but may be slightly harder to implement.
>
> Still, it can be done with a fairly small patch that has an additional
> advantage, as it allows us to get rid of the FAST_FREE constant
> which I don't like. Appended (untested).

Looks good to me.

> Index: linux-2.6.15-rc5-mm1/kernel/power/swsusp.c
> ===================================================================
> --- linux-2.6.15-rc5-mm1.orig/kernel/power/swsusp.c 2005-12-05 22:07:12.000000000 +0100
> +++ linux-2.6.15-rc5-mm1/kernel/power/swsusp.c 2005-12-07 12:40:27.000000000 +0100
> @@ -626,6 +626,7 @@
>
> int swsusp_shrink_memory(void)
> {
> + unsigned long size;
> long tmp;

Perhaps both should be long, or both unsigned long?

Pavel
--
Thanks, Sharp!

Rafael J. Wysocki

unread,

Dec 7, 2005, 7:20:13 AM12/7/05

to

Hi,

tmp has to be signed. Both can be long, though.

Should I test it and post for merging?

Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

Pavel Machek

unread,

Dec 7, 2005, 7:30:20 AM12/7/05

to

Hi!

> > Looks good to me.
> >
> > > Index: linux-2.6.15-rc5-mm1/kernel/power/swsusp.c
> > > ===================================================================
> > > --- linux-2.6.15-rc5-mm1.orig/kernel/power/swsusp.c 2005-12-05 22:07:12.000000000 +0100
> > > +++ linux-2.6.15-rc5-mm1/kernel/power/swsusp.c 2005-12-07 12:40:27.000000000 +0100
> > > @@ -626,6 +626,7 @@
> > >
> > > int swsusp_shrink_memory(void)
> > > {
> > > + unsigned long size;
> > > long tmp;
> >
> > Perhaps both should be long, or both unsigned long?
>
> tmp has to be signed. Both can be long, though.
>
> Should I test it and post for merging?

Yes..
Pavel

--
Thanks, Sharp!

Nigel Cunningham

unread,

Dec 7, 2005, 5:10:07 PM12/7/05

to

Hi.

On Wed, 2005-12-07 at 00:22, Pavel Machek wrote:
> Hi!
>
> > Hi. Tue, 2005-12-06 at 12:06, Andy Isaacson wrote:
> > > Could we rework it to avoid writing clean pages out to the swsusp image,
> > > but keep a list of those pages and read them back in *after* having
> > > resumed? Maybe do the /dev/initrd ('less +/once Documentation/initrd.txt'
> > > if you're not familiar with it) trick to make the list of pages available
> > > to a userland helper.
> >
> > The problem is that once you let userspace run, you have absolutely no
> > control over what pages are read from or written to, and if a userspace
> > app assumes that data is there in a page when it isn't, you have a
> > recipe for an oops at best, and possibly for on disk
> > corruption. Pages
>
> No, that will not be a problem. You just resume system as you do now,
> most pages will be not there. *But kernel knows it is not there*, and
> will on-demand load them back. It will be normal userland application
> doing readback. There's absolutely no risk of corruption.

How does the kernel know the pages aren't there? I thought for a while
yesterday that I'd just misread something, but as I look at this again
this morning, I'm not so sure. For what you're talking about to work,
you'd need to mess with the page tables so that the kernel doesn't think
those pages are still there.

I can understand how you'd remember what pages to fault in, but getting
the kernel to know they're not there sounds like a rewrite of kswapd.

Regards,

Nigel

> Imagine something that saves list of needed pages before suspend, then
> does something like
>
> cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` > /dev/null
>
> ...it should work pretty well. And worst thing it can do is send your
> system thrashing.
>
> Pavel
--

Pavel Machek

unread,

Dec 7, 2005, 5:30:07 PM12/7/05

to

Hi!

> > > > Could we rework it to avoid writing clean pages out to the swsusp image,
> > > > but keep a list of those pages and read them back in *after* having
> > > > resumed? Maybe do the /dev/initrd ('less +/once Documentation/initrd.txt'
> > > > if you're not familiar with it) trick to make the list of pages available
> > > > to a userland helper.
> > >
> > > The problem is that once you let userspace run, you have absolutely no
> > > control over what pages are read from or written to, and if a userspace
> > > app assumes that data is there in a page when it isn't, you have a
> > > recipe for an oops at best, and possibly for on disk
> > > corruption. Pages
> >
> > No, that will not be a problem. You just resume system as you do now,
> > most pages will be not there. *But kernel knows it is not there*, and
> > will on-demand load them back. It will be normal userland application
> > doing readback. There's absolutely no risk of corruption.
>
> How does the kernel know the pages aren't there? I thought for a
> while

We evict them, as normally.

Literary:

cat /dev/pagecache-contents > /tmp/state
echo disk > /sys/power/state # Does shrink all memory, as usual, so
# pages are gone, and kernel knows
# they are gone. We are using normal
# memory management code.
( cat `cat /tmp/delme` > /dev/null ) &

Pavel
--
Thanks, Sharp!

Rafael J. Wysocki

unread,

Dec 8, 2005, 5:50:10 PM12/8/05

to

Hi,

On Wednesday, 7 December 2005 12:30, Pavel Machek wrote:
}-- snip --{

> > Let me explain what I have in mind.
> >
> > For starters, please observe that the addresses we use are page-aligned,
> > so the least significant bit is always zero. Thus it can be used as a marker.
> >
> > Now before we save the image we can mark blank pages by setting
> > the least significant bit of .orig_address to 1 in the coresponding PBEs.
> > We save the "marked" .orig_address values to the image.
>
> Well, nice optimalization, but how many pages are actually full of
> zeros?

According to the results I have obtained, there are about 1000 such
pages in the image on my box, for total image sizes between 28000
and 80000 pages (ie above 28000 of pages in the image the number
of blank pages is almost constant).

Greetings,
Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin

Pavel Machek

unread,

Dec 8, 2005, 6:00:18 PM12/8/05

to

Hi!

> > > Let me explain what I have in mind.
> > >
> > > For starters, please observe that the addresses we use are page-aligned,
> > > so the least significant bit is always zero. Thus it can be used as a marker.
> > >
> > > Now before we save the image we can mark blank pages by setting
> > > the least significant bit of .orig_address to 1 in the coresponding PBEs.
> > > We save the "marked" .orig_address values to the image.
> >
> > Well, nice optimalization, but how many pages are actually full of
> > zeros?
>
> According to the results I have obtained, there are about 1000 such
> pages in the image on my box, for total image sizes between 28000
> and 80000 pages (ie above 28000 of pages in the image the number
> of blank pages is almost constant).

4MB of zeros. I'd say that ewe have bigger problems to
solve. Detecting zero pages is actually very trivial form of
compression, and I think that we might as well do it properly, and use
something like LZW -- if we want to go that way. I think that should
be userspace problem.
Pavel

--
Thanks, Sharp!

Andrew Morton

unread,

Dec 10, 2005, 5:30:21 PM12/10/05

to

"Rafael J. Wysocki" <r...@sisk.pl> wrote:
>
> Till now, we have used the simplistic approach
> based on freeing as much memory as possible before suspend. Now, we
> are freeing only as much memory as necessary, which is on the other
> end of the scale, so to speak.

You might want to play with
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-rc5/2.6.15-rc5-mm1/broken-out/drop-pagecache.patch.
That's a fast-and-easy way of freeing up quite a lot of memory.

Rafael J. Wysocki

unread,

Dec 10, 2005, 6:10:09 PM12/10/05

to

On Saturday, 10 December 2005 23:21, Andrew Morton wrote:
> "Rafael J. Wysocki" <r...@sisk.pl> wrote:
> >
> > Till now, we have used the simplistic approach
> > based on freeing as much memory as possible before suspend. Now, we
> > are freeing only as much memory as necessary, which is on the other
> > end of the scale, so to speak.
>
> You might want to play with
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-rc5/2.6.15-rc5-mm1/broken-out/drop-pagecache.patch.
> That's a fast-and-easy way of freeing up quite a lot of memory.

Thanks a lot for this hint. :-)

Would that be ok if I made drop_pagecache() nonstatic and called it directly
from swsusp?

Andrew Morton

unread,

Dec 10, 2005, 6:40:14 PM12/10/05

to

"Rafael J. Wysocki" <r...@sisk.pl> wrote:
>
> Would that be ok if I made drop_pagecache() nonstatic and called it directly
> from swsusp?

Sure, I'll updates the patch for that.

It changed a bit.. You'll need to run sys_sync() then drop_pagecache()
then drop_slab().

From: Andrew Morton <ak...@osdl.org>

Add /proc/sys/vm/drop_caches. When written to, this will cause the kernel
to discard as much pagecache and/or reclaimable slab objects as it can.

It won't drop dirty data, so the user should run `sync' first.

Caveats:

a) Holds inode_lock for exorbitant amounts of time.

b) Needs to be taught about NUMA nodes: propagate these all the way through
so the discarding can be controlled on a per-node basis.

Signed-off-by: Andrew Morton <ak...@osdl.org>
---

diff -puN /dev/null fs/drop_caches.c
--- /dev/null 2003-09-15 06:40:47.000000000 -0700
+++ devel-akpm/fs/drop_caches.c 2005-12-10 15:31:19.000000000 -0800
@@ -0,0 +1,68 @@
+/*
+ * Implement the manual drop-all-pagecache function
+ */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/writeback.h>
+#include <linux/sysctl.h>
+#include <linux/gfp.h>
+
+/* A global variable is a bit ugly, but it keeps the code simple */
+int sysctl_drop_caches;
+
+static void drop_pagecache_sb(struct super_block *sb)
+{
+ struct inode *inode;
+
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ if (inode->i_state & (I_FREEING|I_WILL_FREE))
+ continue;
+ invalidate_inode_pages(inode->i_mapping);
+ }
+ spin_unlock(&inode_lock);
+}
+
+void drop_pagecache(void)
+{
+ struct super_block *sb;
+
+ spin_lock(&sb_lock);
+restart:
+ list_for_each_entry(sb, &super_blocks, s_list) {
+ sb->s_count++;
+ spin_unlock(&sb_lock);
+ down_read(&sb->s_umount);
+ if (sb->s_root)
+ drop_pagecache_sb(sb);
+ up_read(&sb->s_umount);
+ spin_lock(&sb_lock);
+ if (__put_super_and_need_restart(sb))
+ goto restart;
+ }
+ spin_unlock(&sb_lock);
+}
+
+void drop_slab(void)
+{
+ int nr_objects;
+
+ do {
+ nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+ } while (nr_objects > 10);
+}
+
+int drop_caches_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+ if (write) {
+ if (sysctl_drop_caches & 1)
+ drop_pagecache();
+ if (sysctl_drop_caches & 2)
+ drop_slab();
+ }
+ return 0;
+}
diff -puN fs/Makefile~drop-pagecache fs/Makefile
--- devel/fs/Makefile~drop-pagecache 2005-12-10 15:30:17.000000000 -0800
+++ devel-akpm/fs/Makefile 2005-12-10 15:30:17.000000000 -0800
@@ -10,7 +10,7 @@ obj-y := open.o read_write.o file_table.
ioctl.o readdir.o select.o fifo.o locks.o dcache.o inode.o \
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o mpage.o direct-io.o \
- ioprio.o pnode.o
+ ioprio.o pnode.o drop_caches.o

obj-$(CONFIG_INOTIFY) += inotify.o
obj-$(CONFIG_EPOLL) += eventpoll.o
diff -puN include/linux/mm.h~drop-pagecache include/linux/mm.h
--- devel/include/linux/mm.h~drop-pagecache 2005-12-10 15:30:17.000000000 -0800
+++ devel-akpm/include/linux/mm.h 2005-12-10 15:31:35.000000000 -0800
@@ -1010,5 +1010,12 @@ int in_gate_area_no_task(unsigned long a
/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
#define OOM_DISABLE -17

+int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+int shrink_slab(unsigned long scanned, gfp_t gfp_mask,
+ unsigned long lru_pages);
+void drop_pagecache(void);
+void drop_slab(void);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff -puN include/linux/sysctl.h~drop-pagecache include/linux/sysctl.h
--- devel/include/linux/sysctl.h~drop-pagecache 2005-12-10 15:30:17.000000000 -0800
+++ devel-akpm/include/linux/sysctl.h 2005-12-10 15:30:17.000000000 -0800
@@ -180,6 +180,7 @@ enum
VM_VFS_CACHE_PRESSURE=26, /* dcache/icache reclaim pressure */
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
+ VM_DROP_PAGECACHE=29, /* int: nuke lots of pagecache */
};

diff -puN kernel/sysctl.c~drop-pagecache kernel/sysctl.c
--- devel/kernel/sysctl.c~drop-pagecache 2005-12-10 15:30:17.000000000 -0800
+++ devel-akpm/kernel/sysctl.c 2005-12-10 15:30:17.000000000 -0800
@@ -68,6 +68,7 @@ extern int min_free_kbytes;
extern int printk_ratelimit_jiffies;
extern int printk_ratelimit_burst;
extern int pid_max_min, pid_max_max;
+extern int sysctl_drop_caches;

#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -775,6 +776,15 @@ static ctl_table vm_table[] = {
.strategy = &sysctl_intvec,
},
{
+ .ctl_name = VM_DROP_PAGECACHE,
+ .procname = "drop_caches",
+ .data = &sysctl_drop_caches,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = drop_caches_sysctl_handler,
+ .strategy = &sysctl_intvec,
+ },
+ {
.ctl_name = VM_MIN_FREE_KBYTES,
.procname = "min_free_kbytes",
.data = &min_free_kbytes,
diff -puN mm/truncate.c~drop-pagecache mm/truncate.c
--- devel/mm/truncate.c~drop-pagecache 2005-12-10 15:30:17.000000000 -0800
+++ devel-akpm/mm/truncate.c 2005-12-10 15:30:17.000000000 -0800
@@ -249,7 +249,6 @@ unlock:
break;
}
pagevec_release(&pvec);
- cond_resched();
}
return ret;
}
diff -puN mm/vmscan.c~drop-pagecache mm/vmscan.c
--- devel/mm/vmscan.c~drop-pagecache 2005-12-10 15:30:17.000000000 -0800
+++ devel-akpm/mm/vmscan.c 2005-12-10 15:30:17.000000000 -0800
@@ -183,8 +183,7 @@ EXPORT_SYMBOL(remove_shrinker);
*
* Returns the number of slab objects which we shrunk.
*/
-static int shrink_slab(unsigned long scanned, gfp_t gfp_mask,
- unsigned long lru_pages)
+int shrink_slab(unsigned long scanned, gfp_t gfp_mask, unsigned long lru_pages)
{
struct shrinker *shrinker;
int ret = 0;
diff -puN Documentation/sysctl/vm.txt~drop-pagecache Documentation/sysctl/vm.txt
--- devel/Documentation/sysctl/vm.txt~drop-pagecache 2005-12-10 15:30:17.000000000 -0800
+++ devel-akpm/Documentation/sysctl/vm.txt 2005-12-10 15:30:17.000000000 -0800
@@ -26,12 +26,13 @@ Currently, these files are in /proc/sys/
- min_free_kbytes
- laptop_mode
- block_dump
+- drop-caches

==============================================================

dirty_ratio, dirty_background_ratio, dirty_expire_centisecs,
dirty_writeback_centisecs, vfs_cache_pressure, laptop_mode,
-block_dump, swap_token_timeout:
+block_dump, swap_token_timeout, drop-caches:

See Documentation/filesystems/proc.txt

diff -puN Documentation/filesystems/proc.txt~drop-pagecache Documentation/filesystems/proc.txt
--- devel/Documentation/filesystems/proc.txt~drop-pagecache 2005-12-10 15:30:17.000000000 -0800
+++ devel-akpm/Documentation/filesystems/proc.txt 2005-12-10 15:30:17.000000000 -0800
@@ -1302,6 +1302,23 @@ VM has token based thrashing control mec
unnecessary page faults in thrashing situation. The unit of the value is
second. The value would be useful to tune thrashing behavior.

+drop_caches
+-----------
+
+Writing to this will cause the kernel to drop clean caches, dentries and
+inodes from memory, causing that memory to become free.
+
+To free caches:
+ echo 1 > /proc/sys/vm/drop_caches
+To free dentries and inodes:
+ echo 2 > /proc/sys/vm/drop_caches
+To free caches, dentries and inodes:
+ echo 3 > /proc/sys/vm/drop_caches
+
+As this is a non-destructive operation and dirty objects are not freeable, the
+user should run `sync' first.
+
+
2.5 /proc/sys/dev - Device specific parameters
----------------------------------------------

_

Rafael J. Wysocki

unread,

Dec 11, 2005, 7:20:12 AM12/11/05

to

On Sunday, 11 December 2005 00:33, Andrew Morton wrote:
> "Rafael J. Wysocki" <r...@sisk.pl> wrote:
> >
> > Would that be ok if I made drop_pagecache() nonstatic and called it directly
> > from swsusp?
>
> Sure, I'll updates the patch for that.

Thanks a lot.

> It changed a bit.. You'll need to run sys_sync() then drop_pagecache()
> then drop_slab().

I think it won't hurt if we do this unconditionally in swsusp_shrink_memory().
Pavel, what do you think?

The appended patch illustrates the way in which I think we can use this.
I've tested it a little, but if someone feels like trying it, please do.

Greetings,
Rafael

kernel/power/swsusp.c | 3 +++
1 files changed, 3 insertions(+)

Index: linux-2.6.15-rc5-mm1/kernel/power/swsusp.c
===================================================================
--- linux-2.6.15-rc5-mm1.orig/kernel/power/swsusp.c 2005-12-10 23:51:00.000000000 +0100
+++ linux-2.6.15-rc5-mm1/kernel/power/swsusp.c 2005-12-11 12:45:57.000000000 +0100
@@ -641,6 +641,9 @@ int swsusp_shrink_memory(void)
char *p = "-\\|/";

printk("Shrinking memory... ");
+ sys_sync();
+ drop_pagecache();
+ drop_slab();
do {

size = 2 * count_highmem_pages();

size += size / 50 + count_data_pages();

Pavel Machek

unread,

Dec 11, 2005, 6:30:13 PM12/11/05

to

Hi!

> > > Would that be ok if I made drop_pagecache() nonstatic and called it directly
> > > from swsusp?
> >
> > Sure, I'll updates the patch for that.
>
> Thanks a lot.
>
> > It changed a bit.. You'll need to run sys_sync() then drop_pagecache()
> > then drop_slab().
>
> I think it won't hurt if we do this unconditionally in swsusp_shrink_memory().
> Pavel, what do you think?
>
> The appended patch illustrates the way in which I think we can use this.
> I've tested it a little, but if someone feels like trying it, please
> do.

Not sure, do we really want to drop all the pagecache? We want to free
memory that is not going to be used soon after suspend, but I guess
pagecache can be quite "hot".

I'd certainly wait with this until code settles. And at least trying
if it helps or hurts...
Pavel

--
Thanks, Sharp!

Rafael J. Wysocki

unread,

Dec 12, 2005, 12:50:08 PM12/12/05

to

Hi,

On Monday, 12 December 2005 00:28, Pavel Machek wrote:
> > > > Would that be ok if I made drop_pagecache() nonstatic and called it directly
> > > > from swsusp?
> > >
> > > Sure, I'll updates the patch for that.
> >
> > Thanks a lot.
> >
> > > It changed a bit.. You'll need to run sys_sync() then drop_pagecache()
> > > then drop_slab().
> >
> > I think it won't hurt if we do this unconditionally in swsusp_shrink_memory().
> > Pavel, what do you think?
> >
> > The appended patch illustrates the way in which I think we can use this.
> > I've tested it a little, but if someone feels like trying it, please
> > do.
>
> Not sure, do we really want to drop all the pagecache? We want to free
> memory that is not going to be used soon after suspend, but I guess
> pagecache can be quite "hot".
>
> I'd certainly wait with this until code settles. And at least trying
> if it helps or hurts...

Sure. Today it caused my box to stuck in the idle process. ;-) OTOH,
performance-wise it does not seem to hurt.

Greetings,
Rafael

--
Beer is proof that God loves us and wants us to be happy - Benjamin Franklin