[Patch 0/7] Implement crashkernel=auto

Amerigo Wang

unread,

Aug 5, 2009, 7:30:16 AM8/5/09

to

This series of patch implements automatically reserved memory for crashkernel,
by introducing a new boot option "crashkernel=auto". This idea is from Neil.

In case of breaking user-space applications, it modifies this boot option after
it decides how much memory should be reserved.

On different arch, the threshold and reserved memory size is different. Please
refer patch 7/7 which contains an update for the documentation.

Note: This patchset was only tested on x86_64 with differernt memory sizes.

Signed-off-by: WANG Cong <amw...@redhat.com>
Cc: Neil Horman <nho...@redhat.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: Eric W. Biederman <ebie...@xmission.com>
Cc: Tony Luck <tony...@intel.com>
Cc: Anton Vorontsov <avoro...@ru.mvista.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Amerigo Wang

unread,

Aug 5, 2009, 7:30:15 AM8/5/09

to

Introduce a new config option KEXEC_AUTO_RESERVE for x86.

Signed-off-by: WANG Cong <amw...@redhat.com>

---
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig
+++ linux-2.6/arch/x86/Kconfig
@@ -1482,6 +1482,16 @@ config KEXEC
support. As of this writing the exact hardware interface is
strongly in flux, so no good recommendation can be made.

+config KEXEC_AUTO_RESERVE
+ bool "automatically reserve memory for kexec kernel"
+ depends on KEXEC
+ default y
+ ---help---
+ Automatically reserve memory for a kexec kernel, so that you don't
+ need to specify numbers for the "crashkernel=X@Y" boot option,
+ instead you can use "crashkernel=auto".
+ On x86, 128M is reserved.
+
config CRASH_DUMP
bool "kernel crash dumps"
depends on X86_64 || (X86_32 && HIGHMEM)

Amerigo Wang

unread,

Aug 5, 2009, 7:30:14 AM8/5/09

to

Introduce a new config option KEXEC_AUTO_RESERVE for powerpc.

Signed-off-by: WANG Cong <amw...@redhat.com>

---
Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -346,6 +346,16 @@ config KEXEC

support. As of this writing the exact hardware interface is
strongly in flux, so no good recommendation can be made.

+config KEXEC_AUTO_RESERVE
+ bool "automatically reserve memory for kexec kernel"
+ depends on KEXEC
+ default y
+ ---help---
+ Automatically reserve memory for a kexec kernel, so that you don't
+ need to specify numbers for the "crashkernel=X@Y" boot option,
+ instead you can use "crashkernel=auto".

+ On PPC, 256M is reserved and only when you have memory > 4G.
+
config CRASH_DUMP
bool "Build a kdump crash kernel"
depends on PPC64 || 6xx

Amerigo Wang

unread,

Aug 5, 2009, 7:30:13 AM8/5/09

to

Since in patch 2/7 we already implement the generic part, this will
add the rest part for powerpc.

Signed-off-by: WANG Cong <amw...@redhat.com>

---
Index: linux-2.6/arch/powerpc/include/asm/kexec.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/kexec.h
+++ linux-2.6/arch/powerpc/include/asm/kexec.h
@@ -39,6 +39,29 @@ typedef void (*crash_shutdown_t)(void);

#ifdef CONFIG_KEXEC

+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+#ifndef KEXEC_AUTO_RESERVED_SIZE
+#define KEXEC_AUTO_RESERVED_SIZE 1ULL<<28 /* 256M */
+#endif
+#ifndef KEXEC_AUTO_THRESHOLD
+#define KEXEC_AUTO_THRESHOLD 1ULL<<32 /* 4G */
+#endif
+static inline
+unsigned long long arch_default_crash_size(unsigned long long total_size)
+{
+ if (total_size < KEXEC_AUTO_THRESHOLD)
+ return 0;
+ else
+ return KEXEC_AUTO_RESERVED_SIZE;
+}
+static inline
+unsigned long long arch_default_crash_base(void)
+{
+ /* On ppc, 0 means find the base address automatically. */
+ return 0;
+}
+#endif
+
/*
* This function is responsible for capturing register states if coming
* via panic or invoking dump using sysrq-trigger.

Amerigo Wang

unread,

Aug 5, 2009, 7:30:16 AM8/5/09

to

Update the document for kdump.

Signed-off-by: WANG Cong <amw...@redhat.com>

---
Index: linux-2.6/Documentation/kdump/kdump.txt
===================================================================
--- linux-2.6.orig/Documentation/kdump/kdump.txt
+++ linux-2.6/Documentation/kdump/kdump.txt
@@ -147,6 +147,15 @@ System kernel config options
analysis tools require a vmlinux with debug symbols in order to read
and analyze a dump file.

+4) Enable "automatically reserve memory for kexec kernel" in
+ "Processor type and features."
+
+ CONFIG_KEXEC_AUTO_RESERVE=y
+
+ This will let you to use "crashkernel=auto", instead of specifying
+ numbers for "crashkernel=". Note, you need to have enough memory.
+ The threshold and reserved memory size are arch-dependent.
+
Dump-capture kernel config options (Arch Independent)
-----------------------------------------------------

@@ -266,6 +275,13 @@ This would mean:
2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
3) if the RAM size is larger than 2G, then reserve 128M

+Or you can use:
+
+ crashkernel=auto
+
+if you have enough memory. The threshold is 4G, below which this won't work.
+Also the automatically reserved memory size would be 128M on x86, 256M on
+other platforms that have KEXEC.

Boot into System Kernel

Amerigo Wang

unread,

Aug 5, 2009, 7:30:17 AM8/5/09

to

Implement "crashkernel=auto" for x86 first, other arch will be added in the
following patches.

Signed-off-by: WANG Cong <amw...@redhat.com>

---
Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c
+++ linux-2.6/kernel/kexec.c
@@ -37,6 +37,7 @@
#include <asm/io.h>
#include <asm/system.h>
#include <asm/sections.h>
+#include <asm/setup.h>

/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t* crash_notes;
@@ -1297,6 +1298,38 @@ int __init parse_crashkernel(char *cm

ck_cmdline += 12; /* strlen("crashkernel=") */

+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+ if (strncmp(ck_cmdline, "auto", 4) == 0) {
+ unsigned long long size;
+ char tmp[32];
+
+ size = arch_default_crash_size(system_ram);
+ if (size != 0) {
+ *crash_size = size;
+ *crash_base = arch_default_crash_base();
+ size = scnprintf(tmp, sizeof(tmp), "%luM@%luM",
+ (unsigned long)(*crash_size)>>20,
+ (unsigned long)(*crash_base)>>20);
+ /* size can't be <= 4. */
+ if (likely((size - 4 + strlen(cmdline))
+ < COMMAND_LINE_SIZE - 1)) {
+ memmove(ck_cmdline + size, ck_cmdline + 4,
+ strlen(cmdline) - (ck_cmdline + 4 - cmdline) + 1);
+ memcpy(ck_cmdline, tmp, size);
+ }

+ return 0;
+ } else {

+ /*
+ * We can't reserve memory auotmatcally,
+ * remove "crashkernel=" from cmdline.
+ */
+ ck_cmdline += 4; /* strlen("auto") */
+ memmove(ck_cmdline - 16, ck_cmdline,
+ strlen(cmdline) - (ck_cmdline - cmdline) + 1);
+ return -ENOMEM;
+ }
+ }
+#endif
/*
* if the commandline contains a ':', then that's the extended
* syntax -- if not, it must be the classic syntax
Index: linux-2.6/arch/x86/include/asm/kexec.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/kexec.h
+++ linux-2.6/arch/x86/include/asm/kexec.h
@@ -61,6 +61,29 @@
# define KEXEC_ARCH KEXEC_ARCH_X86_64
#endif

+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+#ifndef KEXEC_AUTO_RESERVED_SIZE
+#define KEXEC_AUTO_RESERVED_SIZE 1ULL<<27 /* 128M */

+#endif
+#ifndef KEXEC_AUTO_THRESHOLD
+#define KEXEC_AUTO_THRESHOLD 1ULL<<32 /* 4G */
+#endif
+static inline
+unsigned long long arch_default_crash_size(unsigned long long total_size)
+{
+ if (total_size < KEXEC_AUTO_THRESHOLD)
+ return 0;
+ else
+ return KEXEC_AUTO_RESERVED_SIZE;
+}
+static inline
+unsigned long long arch_default_crash_base(void)
+{

+ /* On x86, 0 means find the base address automatically. */

+ return 0;
+}
+#endif
+
/*

* CPU does not save ss and sp on stack if execution is already
* running in kernel mode at the time of NMI occurrence. This code

Eric W. Biederman

unread,

Aug 5, 2009, 9:40:14 AM8/5/09

to

Amerigo Wang <amw...@redhat.com> writes:

> This series of patch implements automatically reserved memory for crashkernel,
> by introducing a new boot option "crashkernel=auto". This idea is from Neil.
>
> In case of breaking user-space applications, it modifies this boot option after
> it decides how much memory should be reserved.
>
> On different arch, the threshold and reserved memory size is different. Please
> refer patch 7/7 which contains an update for the documentation.
>
> Note: This patchset was only tested on x86_64 with differernt memory sizes.

This seems like a silly hard code. Especially for a feature distros don't
care enough about to implement a working initrd for.

Has anyone bothered to justify those large amounts of memory?
Where does the 128M go?

Please pardon me for being a cynic but I don't see the command line option
being the bottleneck for real users to make this work.

Eric

Neil Horman

unread,

Aug 5, 2009, 9:50:12 AM8/5/09

to

On Wed, Aug 05, 2009 at 07:19:12AM -0400, Amerigo Wang wrote:
>
> Introduce a new config option KEXEC_AUTO_RESERVE for x86.
>
> Signed-off-by: WANG Cong <amw...@redhat.com>
>
> ---
> Index: linux-2.6/arch/x86/Kconfig
> ===================================================================
> --- linux-2.6.orig/arch/x86/Kconfig
> +++ linux-2.6/arch/x86/Kconfig
> @@ -1482,6 +1482,16 @@ config KEXEC
> support. As of this writing the exact hardware interface is
> strongly in flux, so no good recommendation can be made.
>
> +config KEXEC_AUTO_RESERVE
> + bool "automatically reserve memory for kexec kernel"
> + depends on KEXEC
> + default y
> + ---help---
> + Automatically reserve memory for a kexec kernel, so that you don't
> + need to specify numbers for the "crashkernel=X@Y" boot option,
> + instead you can use "crashkernel=auto".
> + On x86, 128M is reserved.
> +
> config CRASH_DUMP
> bool "kernel crash dumps"
> depends on X86_64 || (X86_32 && HIGHMEM)

Acked-by: Neil Horman <nho...@tuxdriver.com>

Neil Horman

unread,

Aug 5, 2009, 9:50:17 AM8/5/09

to

What about all the other arches that support kexec? ia64/ppc[64]/s390/etc?
Don't they need an implementation of arch_default_crash_size? Or perhaps better
still you should put this definition in asm-generic, so that it can be
overridden per-arch if need be, but you always have something to fall back on.

Neil

Neil Horman

unread,

Aug 5, 2009, 10:00:16 AM8/5/09

to

Same comment here, this looks like it really belongs in asm-generic to me.
Neil

Neil Horman

unread,

Aug 5, 2009, 10:00:15 AM8/5/09

to

On Wed, Aug 05, 2009 at 07:19:51AM -0400, Amerigo Wang wrote:
>
> Introduce a new config option KEXEC_AUTO_RESERVE for powerpc.
>
> Signed-off-by: WANG Cong <amw...@redhat.com>
>
> ---
> Index: linux-2.6/arch/powerpc/Kconfig
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/Kconfig
> +++ linux-2.6/arch/powerpc/Kconfig
> @@ -346,6 +346,16 @@ config KEXEC
> support. As of this writing the exact hardware interface is
> strongly in flux, so no good recommendation can be made.
>
> +config KEXEC_AUTO_RESERVE
> + bool "automatically reserve memory for kexec kernel"
> + depends on KEXEC
> + default y
> + ---help---
> + Automatically reserve memory for a kexec kernel, so that you don't
> + need to specify numbers for the "crashkernel=X@Y" boot option,
> + instead you can use "crashkernel=auto".
> + On PPC, 256M is reserved and only when you have memory > 4G.
> +
> config CRASH_DUMP
> bool "Build a kdump crash kernel"
> depends on PPC64 || 6xx

Acked-by: Neil Horman <nho...@tuxdriver.com>

Neil Horman

unread,

Aug 5, 2009, 10:10:12 AM8/5/09

to

On Wed, Aug 05, 2009 at 06:33:57AM -0700, Eric W. Biederman wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>
> > This series of patch implements automatically reserved memory for crashkernel,
> > by introducing a new boot option "crashkernel=auto". This idea is from Neil.
> >
> > In case of breaking user-space applications, it modifies this boot option after
> > it decides how much memory should be reserved.
> >
> > On different arch, the threshold and reserved memory size is different. Please
> > refer patch 7/7 which contains an update for the documentation.
> >
> > Note: This patchset was only tested on x86_64 with differernt memory sizes.
>
> This seems like a silly hard code. Especially for a feature distros don't
> care enough about to implement a working initrd for.
>
> Has anyone bothered to justify those large amounts of memory?
> Where does the 128M go?
>
> Please pardon me for being a cynic but I don't see the command line option
> being the bottleneck for real users to make this work.
>
> Eric

Lots of the impetus behind this results from a desire to have kexec configured
and setup up during install. Having the kernel allocate a default size block of
RAM lets you do that without the need for an interim reboot. You could of
course boot the installer kernel with a crashkernel line pre-selected suppose,
but then you have to go to the trouble of figuring that allocation size out each
time. This gives you a nice convienent way to get a reasonable block of memory
without the need to do all that extra work.
Neil

Andi Kleen

unread,

Aug 5, 2009, 10:50:16 AM8/5/09

to

Amerigo Wang <amw...@redhat.com> writes:

> Introduce a new config option KEXEC_AUTO_RESERVE for x86.

The description of the feature belongs in the changelog.

I like the basic idea, but:

> +config KEXEC_AUTO_RESERVE
> + bool "automatically reserve memory for kexec kernel"
> + depends on KEXEC
> + default y
> + ---help---
> + Automatically reserve memory for a kexec kernel, so that you don't
> + need to specify numbers for the "crashkernel=X@Y" boot option,
> + instead you can use "crashkernel=auto".
> + On x86, 128M is reserved.

The obvious problem is the hardcoded 128MB (and 128MB is very large
for a crash kernel anyways)

More useful would seem a crashkernel=size@auto

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Eric W. Biederman

unread,

Aug 5, 2009, 4:10:11 PM8/5/09

to

Andi Kleen <an...@firstfloor.org> writes:

> Amerigo Wang <amw...@redhat.com> writes:
>
>> Introduce a new config option KEXEC_AUTO_RESERVE for x86.
>
> The description of the feature belongs in the changelog.
>
> I like the basic idea, but:
>
>> +config KEXEC_AUTO_RESERVE
>> + bool "automatically reserve memory for kexec kernel"
>> + depends on KEXEC
>> + default y
>> + ---help---
>> + Automatically reserve memory for a kexec kernel, so that you don't
>> + need to specify numbers for the "crashkernel=X@Y" boot option,
>> + instead you can use "crashkernel=auto".
>> + On x86, 128M is reserved.
>
> The obvious problem is the hardcoded 128MB (and 128MB is very large
> for a crash kernel anyways)
>
> More useful would seem a crashkernel=size@auto

That is actually called "crashkernel=size" and we have had that for quite
a while. Although some of the init scripts have problems.

Eric

Eric W. Biederman

unread,

Aug 5, 2009, 7:00:09 PM8/5/09

to

Neil Horman <nho...@redhat.com> writes:

> On Wed, Aug 05, 2009 at 06:33:57AM -0700, Eric W. Biederman wrote:
>> Amerigo Wang <amw...@redhat.com> writes:
>>
>> > This series of patch implements automatically reserved memory for crashkernel,
>> > by introducing a new boot option "crashkernel=auto". This idea is from Neil.
>> >
>> > In case of breaking user-space applications, it modifies this boot option after
>> > it decides how much memory should be reserved.
>> >
>> > On different arch, the threshold and reserved memory size is different. Please
>> > refer patch 7/7 which contains an update for the documentation.
>> >
>> > Note: This patchset was only tested on x86_64 with differernt memory sizes.
>>
>> This seems like a silly hard code. Especially for a feature distros don't
>> care enough about to implement a working initrd for.
>>
>> Has anyone bothered to justify those large amounts of memory?
>> Where does the 128M go?
>>
>> Please pardon me for being a cynic but I don't see the command line option
>> being the bottleneck for real users to make this work.
>>
>> Eric
>
> Lots of the impetus behind this results from a desire to have kexec configured
> and setup up during install. Having the kernel allocate a default size block of
> RAM lets you do that without the need for an interim reboot.

I assume you mean kexec on panic. kexec should be fine and you can arguably
solve this problem with a little bit of userspace glue and a kexec of yourself
during bootup.

> You could of
> course boot the installer kernel with a crashkernel line pre-selected suppose,
> but then you have to go to the trouble of figuring that allocation size out each
> time. This gives you a nice convienent way to get a reasonable block of memory
> without the need to do all that extra work.

My big concern is that you are moving policy into the kernel, when it isn't at
all clear that policy is the right thing to do, and the existing mechanisms give
you enough rope to do this all in userspace.

You also have to build (or at least load) the whole kdump image after
the system boots, and configure someplace for this to be saved.

What class of problems do you expect to catch with this?

What has me puzzled is that the mkdumprd that ships with fedora isn't
usable without patching, and it seems to be steadily getting worse. If the
concern was about getting better bug reports I would expect getting this
functionality into fedora would be where you would be focusing your efforts.

Am I missing something obvious?

Eric

Yu, Fenghua

unread,

Aug 5, 2009, 7:00:13 PM8/5/09

to

>+#ifdef CONFIG_KEXEC_AUTO_RESERVE
>+ if (strncmp(ck_cmdline, "auto", 4) == 0) {
>+ unsigned long long size;
>+ char tmp[32];
>+
>+ size = arch_default_crash_size(system_ram);
>+ if (size != 0) {
>+ *crash_size = size;
>+ *crash_base = arch_default_crash_base();
>+ size = scnprintf(tmp, sizeof(tmp), "%luM@%luM",
>+ (unsigned long)(*crash_size)>>20,
>+ (unsigned long)(*crash_base)>>20);
>+ /* size can't be <= 4. */
>+ if (likely((size - 4 + strlen(cmdline))
>+ < COMMAND_LINE_SIZE - 1)) {
>+ memmove(ck_cmdline + size, ck_cmdline + 4,
>+ strlen(cmdline) - (ck_cmdline + 4 - cmdline)
>+ 1);
>+ memcpy(ck_cmdline, tmp, size);
>+ }

Here the variable "size" has two different meanings. First it used for crash size. Then its meaning is changed to buffer size in ck_cmdline. And types are different too. The type for crash size is unsigned long long. But scnprintf() return int.

Could you use two variables to represent the two meanings for less confusion?

Thanks.

-Fenghua

Amerigo Wang

unread,

Aug 5, 2009, 9:40:07 PM8/5/09

to

Eric W. Biederman wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>
>
>> This series of patch implements automatically reserved memory for crashkernel,
>> by introducing a new boot option "crashkernel=auto". This idea is from Neil.
>>
>> In case of breaking user-space applications, it modifies this boot option after
>> it decides how much memory should be reserved.
>>
>> On different arch, the threshold and reserved memory size is different. Please
>> refer patch 7/7 which contains an update for the documentation.
>>
>> Note: This patchset was only tested on x86_64 with differernt memory sizes.
>>
>
> This seems like a silly hard code. Especially for a feature distros don't
> care enough about to implement a working initrd for.
>
> Has anyone bothered to justify those large amounts of memory?
> Where does the 128M go?
>

If 128M is too big, we can make it to be 64M, that is no problem.
I am very open to this. :)

> Please pardon me for being a cynic but I don't see the command line option
> being the bottleneck for real users to make this work.
>
>

Well, take /me as an example, to be honest, I still have no idea how
much memory I should reserve for s390/sh, if I would use kdump on sh, it
*is* my bottleneck.

Thanks.

Amerigo Wang

unread,

Aug 5, 2009, 9:50:06 PM8/5/09

to

Neil Horman wrote:
> What about all the other arches that support kexec? ia64/ppc[64]/s390/etc?
> Don't they need an implementation of arch_default_crash_size? Or perhaps better
> still you should put this definition in asm-generic, so that it can be
> overridden per-arch if need be, but you always have something to fall back on.
>

Hmm, I think you mean we need ARCH_HAS_XXXX....
OK, it is a good idea, I will try it.

Thanks for your review, Neil.

Amerigo Wang

unread,

Aug 5, 2009, 10:00:10 PM8/5/09

to

Andi Kleen wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>
>
>> Introduce a new config option KEXEC_AUTO_RESERVE for x86.
>>
>
> The description of the feature belongs in the changelog.
>
> I like the basic idea, but:
>
>
>> +config KEXEC_AUTO_RESERVE
>> + bool "automatically reserve memory for kexec kernel"
>> + depends on KEXEC
>> + default y
>> + ---help---
>> + Automatically reserve memory for a kexec kernel, so that you don't
>> + need to specify numbers for the "crashkernel=X@Y" boot option,
>> + instead you can use "crashkernel=auto".
>> + On x86, 128M is reserved.
>>
>
> The obvious problem is the hardcoded 128MB (and 128MB is very large
> for a crash kernel anyways)
>

I think that size has to be hardcoded, or we can make it a bit
changeable, according to the page size.... e.g. on PPC and IA64, page
size can be 16K or more, but x86's page size is always 4K I think.

Hmm, yes, I choose such a large size in order to be safe, but since you
feel this is too large, how about 64M on x86? (On x86_64 Fedora and
RHEL, the size of a kernel binary is about 2M~3M.)

> More useful would seem a crashkernel=size@auto
>

We already have this, just use "crashkernel=size@0". :)

Thanks.

Amerigo Wang

unread,

Aug 5, 2009, 10:00:10 PM8/5/09

to

Yu, Fenghua wrote:
>> +#ifdef CONFIG_KEXEC_AUTO_RESERVE
>> + if (strncmp(ck_cmdline, "auto", 4) == 0) {
>> + unsigned long long size;
>> + char tmp[32];
>> +
>> + size = arch_default_crash_size(system_ram);
>> + if (size != 0) {
>> + *crash_size = size;
>> + *crash_base = arch_default_crash_base();
>> + size = scnprintf(tmp, sizeof(tmp), "%luM@%luM",
>> + (unsigned long)(*crash_size)>>20,
>> + (unsigned long)(*crash_base)>>20);
>> + /* size can't be <= 4. */
>> + if (likely((size - 4 + strlen(cmdline))
>> + < COMMAND_LINE_SIZE - 1)) {
>> + memmove(ck_cmdline + size, ck_cmdline + 4,
>> + strlen(cmdline) - (ck_cmdline + 4 - cmdline)
>> + 1);
>> + memcpy(ck_cmdline, tmp, size);
>> + }
>>
>
> Here the variable "size" has two different meanings. First it used for crash size. Then its meaning is changed to buffer size in ck_cmdline. And types are different too. The type for crash size is unsigned long long. But scnprintf() return int.
>
> Could you use two variables to represent the two meanings for less confusion?
>

Sure, OK, will do.

Thanks.

Amerigo Wang

unread,

Aug 5, 2009, 10:10:07 PM8/5/09

to

Eric W. Biederman wrote:

> Neil Horman <nho...@redhat.com> writes:
>
>> You could of
>> course boot the installer kernel with a crashkernel line pre-selected suppose,
>> but then you have to go to the trouble of figuring that allocation size out each
>> time. This gives you a nice convienent way to get a reasonable block of memory
>> without the need to do all that extra work.
>>
>
> My big concern is that you are moving policy into the kernel, when it isn't at
> all clear that policy is the right thing to do, and the existing mechanisms give
> you enough rope to do this all in userspace.
>

How? This doesn't remove the existing mechanism, just provides a new
choice for user like me who doesn't know how much memory should be
reserved, or who simply doesn't want to concern this since he/she has
very enough memory.

> You also have to build (or at least load) the whole kdump image after
> the system boots, and configure someplace for this to be saved.
>
> What class of problems do you expect to catch with this?
>

Again, try to save the user from choosing numbers for "crashkernel=".

> What has me puzzled is that the mkdumprd that ships with fedora isn't
> usable without patching, and it seems to be steadily getting worse.

Please explain why it is not usable? The patch won't break the
userspace, since it modifies the "crashkernel=" command line dynamically.

Thanks.

Eric W. Biederman

unread,

Aug 5, 2009, 10:50:10 PM8/5/09

to

Amerigo Wang <amw...@redhat.com> writes:

> Eric W. Biederman wrote:
>> Neil Horman <nho...@redhat.com> writes:
>>
>>> You could of
>>> course boot the installer kernel with a crashkernel line pre-selected suppose,
>>> but then you have to go to the trouble of figuring that allocation size out each
>>> time. This gives you a nice convienent way to get a reasonable block of memory
>>> without the need to do all that extra work.
>>>
>>
>> My big concern is that you are moving policy into the kernel, when it isn't at
>> all clear that policy is the right thing to do, and the existing mechanisms give
>> you enough rope to do this all in userspace.
>>
>
>
> How? This doesn't remove the existing mechanism, just provides a new choice for
> user like me who doesn't know how much memory should be reserved, or who simply
> doesn't want to concern this since he/she has very enough memory.

Having end users not caring is fine. My problem with this patch is that it
appears that no one is stepping up and taking responsibility for the crahsdump
mechanism, ensuring it will work, ensuring it will work for users. I see
us solving problems in the wrong place because people are not stepping
up and solving them in the proper place.

To actually use it there are much bigger problems then supplying the size
of the crashdump area.

My personal experience is that to make this work I had to list every kernel
module I needed in the order that they must be loaded, in /etc/kdump.conf
for my filesystem. I had to modify mkdumprd so that it actually generated
a dump ramdisk, and I think I am forgetting some other manual hack that
I had to use as well.

So if you have to do something manually I think the problem is user space,
and not the kernel.

In general I figure that whoever builds the kernel and initrd should be
responsible for testing and figuring out the amount of memory needed.
The primary kernel has no idea what is going to loaded in there and
as such no real idea how much memory is needed.

>> You also have to build (or at least load) the whole kdump image after
>> the system boots, and configure someplace for this to be saved.
>>
>> What class of problems do you expect to catch with this?
>>
>
> Again, try to save the user from choosing numbers for "crashkernel=".

The user being kernel developers? Whoever builds the kernel and initrd
should be responsible for testing and figuring this out.

In a distro context installers etc should be able to setup good defaults
so end users don't have to worry about this.

>> What has me puzzled is that the mkdumprd that ships with fedora isn't
>> usable without patching, and it seems to be steadily getting worse.
>
> Please explain why it is not usable? The patch won't break the userspace, since
> it modifies the "crashkernel=" command line dynamically.

No the crashdump mechanism is useless because user space is already
broken and unusable. At least on fedora and I assume by extension
redhat.

Eric

Amerigo Wang

unread,

Aug 5, 2009, 11:40:08 PM8/5/09

to

Eric W. Biederman wrote:
> In general I figure that whoever builds the kernel and initrd should be
> responsible for testing and figuring out the amount of memory needed.
> The primary kernel has no idea what is going to loaded in there and
> as such no real idea how much memory is needed.
>

Yeah, that is exactly why I _didn't_ pick the idea of reserving memory
automatically and silently without "crashkernel=auto".

If a user specifies "crashkernel=auto", that means he/she has no idea
how much memory to be reserved, he/she wants to let the kernel to
decide. Kernel should know better than the user in this situation.

>>> You also have to build (or at least load) the whole kdump image after
>>> the system boots, and configure someplace for this to be saved.
>>>
>>> What class of problems do you expect to catch with this?
>>>
>>>
>> Again, try to save the user from choosing numbers for "crashkernel=".
>>
>
> The user being kernel developers? Whoever builds the kernel and initrd
> should be responsible for testing and figuring this out.
>
> In a distro context installers etc should be able to setup good defaults
> so end users don't have to worry about this.
>
>

For kernel developers, "crashkernel=auto" should save a lot. You seem
agree with this one.

For users, they rely on the distro which can always specify
"crashkernel=auto" now, not different numbers for different arch, since
"crashkernel=auto" is designed to be safe for all cases. Also saves many
work...

>>> What has me puzzled is that the mkdumprd that ships with fedora isn't
>>> usable without patching, and it seems to be steadily getting worse.
>>>
>> Please explain why it is not usable? The patch won't break the userspace, since
>> it modifies the "crashkernel=" command line dynamically.
>>
>
> No the crashdump mechanism is useless because user space is already
> broken and unusable.

Again, why broken?

Thanks.

Eric W. Biederman

unread,

Aug 6, 2009, 12:00:16 AM8/6/09

to

Amerigo Wang <amw...@redhat.com> writes:

>> No the crashdump mechanism is useless because user space is already
>> broken and unusable.
>
> Again, why broken?

To get a stock stat drive by hand I had to list about 5 kernel modules
in the right magic order in /etc/kdump.conf

Neither mount by label or mount by uuid when specified in /etc/kdump.conf
I had to hack mkdumprd to get an initrd that even finds the proper disk
to mount.

Short version it takes a huge amount of expertise to get what ships with
fedora to pass the trivial alt-sysrq-c test. It would probably be about
as easy to write you own custom initrd by hand.

Eric

Amerigo Wang

unread,

Aug 6, 2009, 2:00:16 AM8/6/09

to

Eric W. Biederman wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>
>
>>> No the crashdump mechanism is useless because user space is already
>>> broken and unusable.
>>>
>> Again, why broken?
>>
>
> To get a stock stat drive by hand I had to list about 5 kernel modules
> in the right magic order in /etc/kdump.conf
>
> Neither mount by label or mount by uuid when specified in /etc/kdump.conf
> I had to hack mkdumprd to get an initrd that even finds the proper disk
> to mount.
>

You are saying that there is some difficulty to make a initrd for kdump,
but I am sorry that I can't see any relations between this and my patch.
What is your point here?

Eric W. Biederman

unread,

Aug 6, 2009, 2:20:09 AM8/6/09

to

Amerigo Wang <amw...@redhat.com> writes:

> Eric W. Biederman wrote:
>> Amerigo Wang <amw...@redhat.com> writes:
>>
>>
>>>> No the crashdump mechanism is useless because user space is already
>>>> broken and unusable.
>>>>
>>> Again, why broken?
>>>
>>
>> To get a stock stat drive by hand I had to list about 5 kernel modules
>> in the right magic order in /etc/kdump.conf
>>
>> Neither mount by label or mount by uuid when specified in /etc/kdump.conf
>> I had to hack mkdumprd to get an initrd that even finds the proper disk
>> to mount.
>>
>
> You are saying that there is some difficulty to make a initrd for kdump, but I
> am sorry that I can't see any relations between this and my patch. What is your
> point here?

You are trying to make it easier for end users.

I am saying the problem is in user space.

I am saying also that the kernel doesn't have a clue what you are
going to load with kexec on panic to handle panics. Maybe it is a
custom stand alone binary that only needs 5K. So the kernel doesn't
have a clue what the right size to reserve.

I think if what you were proposing was part of some coherent story for
a complete implementation I would consider it more. Instead this just
appears to be a reaction to how frustrating the user space
implementation is, and fixing things in the kernel instead of in user
space.

The fact that user space is broken to the point of usability on fedora
simply reinforces the point to me that the problem is there not in the
kernel. So I am pushing back and saying get your user space act together
and then this kernel option won't matter.

I am further saying that this selecting how much memory to use is the least
of your problems.

Eric

Amerigo Wang

unread,

Aug 6, 2009, 2:30:13 AM8/6/09

to

Introduce a new config option KEXEC_AUTO_RESERVE for x86.

Signed-off-by: WANG Cong <amw...@redhat.com>
Acked-by: Neil Horman <nho...@tuxdriver.com>

---
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig
+++ linux-2.6/arch/x86/Kconfig

@@ -1482,6 +1482,16 @@ config KEXEC

support. As of this writing the exact hardware interface is
strongly in flux, so no good recommendation can be made.

+config KEXEC_AUTO_RESERVE
+ bool "automatically reserve memory for kexec kernel"
+ depends on KEXEC
+ default y
+ ---help---
+ Automatically reserve memory for a kexec kernel, so that you don't
+ need to specify numbers for the "crashkernel=X@Y" boot option,
+ instead you can use "crashkernel=auto".

+ On x86, 128M is reserved and only when you have memory > 4G.
+
config CRASH_DUMP

bool "kernel crash dumps"
depends on X86_64 || (X86_32 && HIGHMEM)

Amerigo Wang

unread,

Aug 6, 2009, 2:30:17 AM8/6/09

to

Introduce a new config option KEXEC_AUTO_RESERVE for powerpc.

Signed-off-by: WANG Cong <amw...@redhat.com>
Acked-by: Neil Horman <nho...@tuxdriver.com>

---

Index: linux-2.6/arch/powerpc/Kconfig

===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig

@@ -346,6 +346,16 @@ config KEXEC

support. As of this writing the exact hardware interface is
strongly in flux, so no good recommendation can be made.

+config KEXEC_AUTO_RESERVE
+ bool "automatically reserve memory for kexec kernel"
+ depends on KEXEC
+ default y
+ ---help---
+ Automatically reserve memory for a kexec kernel, so that you don't
+ need to specify numbers for the "crashkernel=X@Y" boot option,
+ instead you can use "crashkernel=auto".

+ On PPC, 256M is reserved and only when you have memory > 4G.
+
config CRASH_DUMP

bool "Build a kdump crash kernel"
depends on PPC64 || 6xx

Amerigo Wang

unread,

Aug 6, 2009, 2:30:20 AM8/6/09

to

Implement "crashkernel=auto" for x86 first, other arch will be added in the
following patches.

(Is 128M too big to be reserved on x86?)

Signed-off-by: WANG Cong <amw...@redhat.com>

---

Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c
+++ linux-2.6/kernel/kexec.c
@@ -37,6 +37,7 @@
#include <asm/io.h>
#include <asm/system.h>
#include <asm/sections.h>
+#include <asm/setup.h>

/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t* crash_notes;

@@ -1297,6 +1298,39 @@ int __init parse_crashkernel(char *cm

ck_cmdline += 12; /* strlen("crashkernel=") */

+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+ if (strncmp(ck_cmdline, "auto", 4) == 0) {
+ unsigned long long size;

+ int len;

+ char tmp[32];
+
+ size = arch_default_crash_size(system_ram);
+ if (size != 0) {
+ *crash_size = size;
+ *crash_base = arch_default_crash_base();

+ len = scnprintf(tmp, sizeof(tmp), "%luM@%luM",

+ (unsigned long)(*crash_size)>>20,
+ (unsigned long)(*crash_base)>>20);

+ /* 'len' can't be <= 4. */
+ if (likely((len - 4 + strlen(cmdline))
+ < COMMAND_LINE_SIZE - 1)) {
+ memmove(ck_cmdline + len, ck_cmdline + 4,

+ strlen(cmdline) - (ck_cmdline + 4 - cmdline) + 1);

+ memcpy(ck_cmdline, tmp, len);
+ }

+ return 0;
+ } else {

+ /*
+ * We can't reserve memory auotmatcally,

+ * remove "crashkernel=auto" from cmdline.

+ */
+ ck_cmdline += 4; /* strlen("auto") */
+ memmove(ck_cmdline - 16, ck_cmdline,
+ strlen(cmdline) - (ck_cmdline - cmdline) + 1);
+ return -ENOMEM;
+ }
+ }
+#endif
/*
* if the commandline contains a ':', then that's the extended
* syntax -- if not, it must be the classic syntax
Index: linux-2.6/arch/x86/include/asm/kexec.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/kexec.h
+++ linux-2.6/arch/x86/include/asm/kexec.h

@@ -23,6 +23,7 @@

#include <asm/page.h>
#include <asm/ptrace.h>
+#include <asm-generic/kexec.h>

/*
* KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
Index: linux-2.6/include/asm-generic/kexec.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/kexec.h
@@ -0,0 +1,27 @@
+#ifndef _ASM_GENERIC_KEXEC_H
+#define _ASM_GENERIC_KEXEC_H
+
+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+#ifndef KEXEC_AUTO_RESERVED_SIZE
+#define KEXEC_AUTO_RESERVED_SIZE (1ULL<<27) /* 128M */
+#endif
+#ifndef KEXEC_AUTO_THRESHOLD
+#define KEXEC_AUTO_THRESHOLD (1ULL<<32) /* 4G */

+#endif
+static inline
+unsigned long long arch_default_crash_size(unsigned long long total_size)
+{
+ if (total_size < KEXEC_AUTO_THRESHOLD)
+ return 0;
+ else
+ return KEXEC_AUTO_RESERVED_SIZE;
+}
+static inline
+unsigned long long arch_default_crash_base(void)
+{

+ /* 0 means find the base address automatically. */
+ return 0;
+}
+#endif /* CONFIG_KEXEC_AUTO_RESERVE */
+
+#endif

Amerigo Wang

unread,

Aug 6, 2009, 2:30:19 AM8/6/09

to

Since in patch 2/7 we already implement the generic part, this will
add the rest part for powerpc.

Signed-off-by: WANG Cong <amw...@redhat.com>

---

Index: linux-2.6/arch/powerpc/include/asm/kexec.h

===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/kexec.h
+++ linux-2.6/arch/powerpc/include/asm/kexec.h

@@ -39,6 +39,14 @@ typedef void (*crash_shutdown_t)(void);

#ifdef CONFIG_KEXEC

+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+#ifdef KEXEC_AUTO_RESERVED_SIZE
+#undef KEXEC_AUTO_RESERVED_SIZE
+#endif
+#define KEXEC_AUTO_RESERVED_SIZE (1ULL<<28) /* 256M */
+#include <asm-generic/kexec.h>

+#endif
+
/*
* This function is responsible for capturing register states if coming
* via panic or invoking dump using sysrq-trigger.

Amerigo Wang

unread,

Aug 6, 2009, 2:30:16 AM8/6/09

to

Introduce a new config option KEXEC_AUTO_RESERVE for ia64.

Signed-off-by: WANG Cong <amw...@redhat.com>
Acked-by: Neil Horman <nho...@tuxdriver.com>

---

Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig
+++ linux-2.6/arch/ia64/Kconfig
@@ -582,6 +582,16 @@ config KEXEC

support. As of this writing the exact hardware interface is
strongly in flux, so no good recommendation can be made.

+config KEXEC_AUTO_RESERVE
+ bool "automatically reserve memory for kexec kernel"
+ depends on KEXEC
+ default y
+ ---help---
+ Automatically reserve memory for a kexec kernel, so that you don't
+ need to specify numbers for the "crashkernel=X@Y" boot option,
+ instead you can use "crashkernel=auto".

+ On IA64, 256M is reserved and only when you have memory > 4G.
+
config CRASH_DUMP
bool "kernel crash dumps"
depends on IA64_MCA_RECOVERY && !IA64_HP_SIM && (!SMP || HOTPLUG_CPU)

Amerigo Wang

unread,

Aug 6, 2009, 2:30:19 AM8/6/09

to

Since in patch 2/7 we already implement the generic part, this will

add the rest part for ia64.

Signed-off-by: WANG Cong <amw...@redhat.com>

---

Index: linux-2.6/arch/ia64/include/asm/kexec.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/kexec.h
+++ linux-2.6/arch/ia64/include/asm/kexec.h
@@ -19,6 +19,14 @@
flush_icache_range(page_addr, page_addr + PAGE_SIZE); \
} while(0)

+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+#ifdef KEXEC_AUTO_RESERVED_SIZE
+#undef KEXEC_AUTO_RESERVED_SIZE
+#endif
+#define KEXEC_AUTO_RESERVED_SIZE (1ULL<<28) /* 256M */
+#include <asm-generic/kexec.h>
+#endif
+

extern struct kimage *ia64_kimage;
extern const unsigned int relocate_new_kernel_size;
extern void relocate_new_kernel(unsigned long, unsigned long,

Amerigo Wang

unread,

Aug 6, 2009, 2:40:09 AM8/6/09

to

Eric W. Biederman wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>
>
>> Eric W. Biederman wrote:
>>
>>> Amerigo Wang <amw...@redhat.com> writes:
>>>
>>>
>>>
>>>>> No the crashdump mechanism is useless because user space is already
>>>>> broken and unusable.
>>>>>
>>>>>
>>>> Again, why broken?
>>>>
>>>>
>>> To get a stock stat drive by hand I had to list about 5 kernel modules
>>> in the right magic order in /etc/kdump.conf
>>>
>>> Neither mount by label or mount by uuid when specified in /etc/kdump.conf
>>> I had to hack mkdumprd to get an initrd that even finds the proper disk
>>> to mount.
>>>
>>>
>> You are saying that there is some difficulty to make a initrd for kdump, but I
>> am sorry that I can't see any relations between this and my patch. What is your
>> point here?
>>
>
> You are trying to make it easier for end users.
>
> I am saying the problem is in user space.
>
> I am saying also that the kernel doesn't have a clue what you are
> going to load with kexec on panic to handle panics. Maybe it is a
> custom stand alone binary that only needs 5K. So the kernel doesn't
> have a clue what the right size to reserve.
>

So what? If you have 8G memory, would you mind 128M-5K memory to be wasted?

The kernel doesn't have to reserve the exact amount of memory that a
kexec kernel will use, it just finds a big enough size for all cases
which already assumes the physical memory is large enough.

> I think if what you were proposing was part of some coherent story for
> a complete implementation I would consider it more. Instead this just
> appears to be a reaction to how frustrating the user space
> implementation is, and fixing things in the kernel instead of in user
> space.
>

Yes, exactly, in fact I am doing another part which will allow us to
take back of the reserved memory at run-time.

Andi Kleen

unread,

Aug 6, 2009, 3:20:05 AM8/6/09

to

>> More useful would seem a crashkernel=size@auto
>>
> We already have this, just use "crashkernel=size@0". :)

When it's already there then I don't see the point of the feature at all.

Hardcoding the size doesn't really make any sense to me, especially
a suspicious one like 128MB.

-Andi

--
a...@linux.intel.com -- Speaking for myself only.

Amerigo Wang

unread,

Aug 6, 2009, 3:50:08 AM8/6/09

to

Andi Kleen wrote:
>>> More useful would seem a crashkernel=size@auto
>>>
>>>
>> We already have this, just use "crashkernel=size@0". :)
>>
>
> When it's already there then I don't see the point of the feature at all.
>
> Hardcoding the size doesn't really make any sense to me, especially
> a suspicious one like 128MB.
>

Hi, Andi,

The point here is the size, hmm... I just got an idea, how about
reserving the same memory of the size of kernel itself? For example, if
the kernel itself is 3M, we reserved 8M ( x2 and then rounded to 2^n)
for the kexec kernel?

Any comments?

Thanks!

Amerigo Wang

unread,

Aug 6, 2009, 4:00:15 AM8/6/09

to

Andi Kleen wrote:
> Hardcoding the size doesn't really make any sense to me, especially
> a suspicious one like 128MB.
>

Er, don't worry about this is too large, I am working on a patch which
will allow us to remove the unnecessary reserved memory from user space. :)

Eric W. Biederman

unread,

Aug 6, 2009, 4:40:08 AM8/6/09

to

Amerigo Wang <amw...@redhat.com> writes:

> The kernel doesn't have to reserve the exact amount of memory that a kexec
> kernel will use, it just finds a big enough size for all cases which already
> assumes the physical memory is large enough.

>> I think if what you were proposing was part of some coherent story for
>> a complete implementation I would consider it more. Instead this just
>> appears to be a reaction to how frustrating the user space
>> implementation is, and fixing things in the kernel instead of in user
>> space.
>>
>
> Yes, exactly, in fact I am doing another part which will allow us to take back
> of the reserved memory at run-time.

Alright. Let's look at that.

I would make the restriction you can't resize the area while a kexec
on panic image is loaded, and growing the area would not be a
realistic option.

If crash_kernel=auto happens in the context of being able to shrink
the area from user space the definition is simple. We reserve as much
memory as we think we can without affecting performance, stability,
reliability.

We can use an initial approximation of perhaps 1/32nd of low memory
(aka directly mapped memory), and I don't see a point in making the
code arch dependent at all. We should run the size approximation past
the folks on linux-mm as they are more likely to know how much memory
reduction we can tolerate without problems.

We can then plan on user space saying hey that is more than I need:
shrink that, and load the kexec on panic kernel.

Eric

Eric W. Biederman

unread,

Aug 6, 2009, 4:50:07 AM8/6/09

to

Hmm. I half take it back. How is crashkernel=auto and then shrinking
the reserved size better than the extended syntax Bernhard Walle
introduced nearly two years ago?

Amerigo Wang

unread,

Aug 6, 2009, 5:10:11 AM8/6/09

to

Eric W. Biederman wrote:
> Hmm. I half take it back. How is crashkernel=auto and then shrinking
> the reserved size better than the extended syntax Bernhard Walle
> introduced nearly two years ago?
>

You mean something like crashkernel=512M-2G:64M,2G-:128M?
Isn't it longer and more complex than "crashkernel=auto" for an end user?

Hmm, this makes me to think that "crashkernel=auto" can be a replacement
for that extended syntax...

Amerigo Wang

unread,

Aug 6, 2009, 5:20:14 AM8/6/09

to

Eric W. Biederman wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>

>> Yes, exactly, in fact I am doing another part which will allow us to take back
>> of the reserved memory at run-time.
>>
>
> Alright. Let's look at that.
>
> I would make the restriction you can't resize the area while a kexec
> on panic image is loaded, and growing the area would not be a
> realistic option.
>
>

Sure, I have no plan to do growing reserved memory at run-time... only
freeing or shrinking it...

> If crash_kernel=auto happens in the context of being able to shrink
> the area from user space the definition is simple. We reserve as much
> memory as we think we can without affecting performance, stability,
> reliability.
>
> We can use an initial approximation of perhaps 1/32nd of low memory
> (aka directly mapped memory), and I don't see a point in making the
> code arch dependent at all. We should run the size approximation past
> the folks on linux-mm as they are more likely to know how much memory
> reduction we can tolerate without problems.
>
>

Yup, agreed.

> We can then plan on user space saying hey that is more than I need:
> shrink that, and load the kexec on panic kernel.
>

Exactly... but the interface still needs to be discussed...

Currently, we have two options:

1) add a new flag to kexec_load(2) to tell the kernel to shrink the memory;
2) use /proc/iomem, let the user to decide which and how much of the
reserved memory should be removed.

Any thoughts?

Thanks.

Bernhard Walle

unread,

Aug 7, 2009, 3:20:10 PM8/7/09

to

Amerigo Wang schrieb:

>
> +#ifdef CONFIG_KEXEC_AUTO_RESERVE
> +#ifdef KEXEC_AUTO_RESERVED_SIZE
> +#undef KEXEC_AUTO_RESERVED_SIZE
> +#endif
> +#define KEXEC_AUTO_RESERVED_SIZE (1ULL<<28) /* 256M */
> +#include <asm-generic/kexec.h>
> +#endif
> +
> extern struct kimage *ia64_kimage;

IMO that's way too small for practial use on IA64 systems.

For SLES11, which is based on Linux 2.6.28 IIRC, we use following memory
size values in the YaST2 kdump module which configures the crashkernel
parameter (this is YCP syntax, but I think everybody understands it):

> // bnc #446480 - Fine-tune kdump memory proposal
> if ((Arch::ia64()) && (total_memory >= 1024))
> {
> integer total_memory_gigabyte = total_memory/1024;
> if ((total_memory_gigabyte >= 1) && (total_memory_gigabyte <12))
> alocated_memory = "256";
> else if ((total_memory_gigabyte >= 12) && (total_memory_gigabyte <128))
> alocated_memory = "512";
> else if ((total_memory_gigabyte >= 128) && (total_memory_gigabyte <256))
> alocated_memory = "768";
> else if ((total_memory_gigabyte >= 256) && (total_memory_gigabyte <378))
> alocated_memory = "1024";
> else if ((total_memory_gigabyte >= 378) && (total_memory_gigabyte <512))
> alocated_memory = "1536";
> else if ((total_memory_gigabyte >= 512) && (total_memory_gigabyte <768))
> alocated_memory = "2048";
> else if (total_memory_gigabyte >= 768)
> alocated_memory = "3072";
> }

I got that assumtions from SGI (and they are known to have large IA64
systems) and I think the values were tested.

But IMO it doesn't make sense to put such policy decisions in the
kernel. I see no advantage for that. The average user doesn't have to
write crashkernel parameters, they use the values that the distribution
ships. Or do you think that an average user knows what a UUID of a file
system is just to specify the correct root partition?

Regards,
Bernhard

Bernhard Walle

unread,

Aug 7, 2009, 3:20:14 PM8/7/09

to

Eric W. Biederman schrieb:

> Hmm. I half take it back. How is crashkernel=auto and then shrinking
> the reserved size better than the extended syntax Bernhard Walle
> introduced nearly two years ago?

BTW: Ubuntu ships by default with

crashkernel=384M-2G:64M@16M,2G-:128M@16M

What is the complexity for the user? I didn't edit /boot/grub/menu.lst,
I just installed kexec-tools and a few other kdump-related packages, and
then this was in my menu.lst.

Don't make the kernel complex. Make the userspace complex.

Regards,
Bernhard

Bernhard Walle

unread,

Aug 7, 2009, 3:30:13 PM8/7/09

to

Amerigo Wang schrieb:

>
> +#ifdef CONFIG_KEXEC_AUTO_RESERVE
> +#ifdef KEXEC_AUTO_RESERVED_SIZE
> +#undef KEXEC_AUTO_RESERVED_SIZE
> +#endif
> +#define KEXEC_AUTO_RESERVED_SIZE (1ULL<<28) /* 256M */
> +#include <asm-generic/kexec.h>
> +#endif
> +
> extern struct kimage *ia64_kimage;

IMO that's way too small for practial use on IA64 systems.

Regards,
Bernhard

Bernhard Walle

unread,

Aug 7, 2009, 3:40:09 PM8/7/09

to

Amerigo Wang schrieb:

>
> +#ifdef CONFIG_KEXEC_AUTO_RESERVE
> +#ifdef KEXEC_AUTO_RESERVED_SIZE
> +#undef KEXEC_AUTO_RESERVED_SIZE
> +#endif
> +#define KEXEC_AUTO_RESERVED_SIZE (1ULL<<28) /* 256M */
> +#include <asm-generic/kexec.h>
> +#endif
> +
> extern struct kimage *ia64_kimage;

IMO that's way too small for practial use on IA64 systems.

Regards,
Bernhard

--

Eric W. Biederman

unread,

Aug 7, 2009, 4:00:21 PM8/7/09

to

Let me put this concrete proposal on the table.

The problem:

With the current set of crashkernel= options we are asking the
distribution installer to perform magic. Moving as much of this logic
into a normal init script for better maintenance is desirable.

My proposal:

Implement crashkernel=max which reserves as much memory as is
reasonable for a crash kernel, without seriously affecting stability,
performance, and reliability.

As an initial approximation I would use a 32nd of low memory.

In addition implement:

/sys/kernel/crash_size

That can be written to (with enough privileges when no crash kernel is
loaded) reduce the amount of memory reserved by the crash kernel.

Bernhard does that sound useful to you?

Amerigo does that seem reasonable?

Eric

Andi Kleen

unread,

Aug 7, 2009, 5:10:15 PM8/7/09

to

> As an initial approximation I would use a 32nd of low memory.

That means a 1TB machine will have a 32GB crash kernel.

Surely that's excessive?!?

It would be repeating all the same mistakes people made with hash tables
several years ago.

>
> That can be written to (with enough privileges when no crash kernel is
> loaded) reduce the amount of memory reserved by the crash kernel.
>
> Bernhard does that sound useful to you?
>
> Amerigo does that seem reasonable?

It doesn't sound reasonable to Andi.

Why do you even want to grow the crash kernel that much? Is there
any real problem with a 64-128MB crash kernel?

-Andi
>

--
a...@linux.intel.com -- Speaking for myself only.

Bernhard Walle

unread,

Aug 7, 2009, 5:30:10 PM8/7/09

to

Andi Kleen schrieb:

>> As an initial approximation I would use a 32nd of low memory.
>
> That means a 1TB machine will have a 32GB crash kernel.
>
> Surely that's excessive?!?
>
> It would be repeating all the same mistakes people made with hash tables
> several years ago.

The idea of Eric was to shrink the reserved memory in an init script. I
doubt that the 1 TB machine will have any problems or performance issue
when booting with (1 TB - 32 GB) memory.

> It doesn't sound reasonable to Andi.
>
> Why do you even want to grow the crash kernel that much? Is there
> any real problem with a 64-128MB crash kernel?

Try it out. No chance for 64-128MB crashkernel on "medium" IA64 machines.

Regards,
Bernhard

Bernhard Walle

unread,

Aug 7, 2009, 5:40:07 PM8/7/09

to

Eric W. Biederman schrieb:

>
> With the current set of crashkernel= options we are asking the
> distribution installer to perform magic. Moving as much of this logic
> into a normal init script for better maintenance is desirable.

Not (necessarily) the installer but the program that configures kdump.
system-config-kdump on Red Hat, YaST on SUSE.

> Bernhard does that sound useful to you?

I don't see any problems. I don't know how much effort is it to free
already reserved crashkernel memory, but I guess it's not really
complicated. Maybe that "1/32" should be specified on the command line like

crashkernel=>>5

(for 1/32*system_memory == system_memory>>5), OTOH I have no real strong
opinion.

Regards,
Bernhard

Eric W. Biederman

unread,

Aug 7, 2009, 6:10:09 PM8/7/09

to

Andi Kleen <an...@firstfloor.org> writes:

>> As an initial approximation I would use a 32nd of low memory.
>
> That means a 1TB machine will have a 32GB crash kernel.
>
> Surely that's excessive?!?
>
> It would be repeating all the same mistakes people made with hash tables
> several years ago.
>
>>
>> That can be written to (with enough privileges when no crash kernel is
>> loaded) reduce the amount of memory reserved by the crash kernel.
>>
>> Bernhard does that sound useful to you?
>>
>> Amerigo does that seem reasonable?
>
> It doesn't sound reasonable to Andi.
>
> Why do you even want to grow the crash kernel that much? Is there
> any real problem with a 64-128MB crash kernel?

Because it is absolutely ridiculous in size and user space will have
to take up the work of trimming back down to something reasonable in
the init script.

At a practical level crash dump userlands do things like fsck
filesystems before they mount them. For truly large machines there
was a desire to parallelize core dump writing to different disks. I
don't know if that has been implemented yet, but in that case you
certainly more ram for buffers tends to be useful.

I think if we are going to go beyond having a magic boot command
line (that we have today) that parametrizes the amount of memory
to reserve based on how much memory we have in the system. We need
to put user space in control. We can only put user space in control
if we initially reserve too much and let it release the memory it
won't use.

That would allow removing magic from installers and leaving it to
installed packages. Which seems a lot more maintainable.

Eric

Eric W. Biederman

unread,

Aug 7, 2009, 6:20:08 PM8/7/09

to

Bernhard Walle <bernhar...@gmx.de> writes:

> Eric W. Biederman schrieb:
>>
>> With the current set of crashkernel= options we are asking the
>> distribution installer to perform magic. Moving as much of this logic
>> into a normal init script for better maintenance is desirable.
>
> Not (necessarily) the installer but the program that configures kdump.
> system-config-kdump on Red Hat, YaST on SUSE.

Right. Somehow I thought YaST was the installer my mistake.

>> Bernhard does that sound useful to you?
>
> I don't see any problems. I don't know how much effort is it to free
> already reserved crashkernel memory, but I guess it's not really
> complicated.

Right.

> Maybe that "1/32" should be specified on the command line like
>
>
> crashkernel=>>5
>
> (for 1/32*system_memory == system_memory>>5), OTOH I have no real strong
> opinion.

The idea is for the system to give us as much as it can stand and
userspace gives the rest back. The maximum memory any particular
kernel can stand to give up is a tractable kernel level problem, and
we can make it autotune like any other kernel tunable. What a crash
kernel needs totally depends on the implementation.

Eric

Amerigo Wang

unread,

Aug 9, 2009, 11:10:07 PM8/9/09

to

Hmm, thanks for this.

> But IMO it doesn't make sense to put such policy decisions in the
> kernel. I see no advantage for that. The average user doesn't have to
> write crashkernel parameters, they use the values that the distribution
> ships. Or do you think that an average user knows what a UUID of a file
> system is just to specify the correct root partition?
>

The advantage is that we can provide a clever policy which can't be
implemented with current mechanism, e.g. 32nd of phy mem, as proposed by
Eric.

Amerigo Wang

unread,

Aug 9, 2009, 11:20:11 PM8/9/09

to

Eric W. Biederman wrote:
> Let me put this concrete proposal on the table.
>
> The problem:
>
> With the current set of crashkernel= options we are asking the
> distribution installer to perform magic. Moving as much of this logic
> into a normal init script for better maintenance is desirable.
>
> My proposal:
>
> Implement crashkernel=max which reserves as much memory as is
> reasonable for a crash kernel, without seriously affecting stability,
> performance, and reliability.
>

This is almost exactly what I want with crashkernel=auto....
So there's no big difference, except the name.

> As an initial approximation I would use a 32nd of low memory.
>

Hmm, I think Bernhard's proposal is fine for this case, i.e. we can
introduce a new syntax, "crashkernel=>>X" which means we reserve 1/2^X
of system memory.

What do you think?

> In addition implement:
>
> /sys/kernel/crash_size
>
> That can be written to (with enough privileges when no crash kernel is
> loaded) reduce the amount of memory reserved by the crash kernel.
>

Yeah, this is nice!

Thanks.

Amerigo Wang

unread,

Aug 12, 2009, 4:20:04 AM8/12/09

to

Since in patch 2/8 we already implement the generic part, this will

add the rest part for ia64.

Signed-off-by: WANG Cong <amw...@redhat.com>

---

Index: linux-2.6/arch/ia64/include/asm/kexec.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/kexec.h
+++ linux-2.6/arch/ia64/include/asm/kexec.h

@@ -19,6 +19,29 @@

flush_icache_range(page_addr, page_addr + PAGE_SIZE); \
} while(0)

+#ifdef CONFIG_KEXEC_AUTO_RESERVE

+#define ARCH_HAS_DEFAULT_CRASH_SIZE

+static inline
+unsigned long long arch_default_crash_size(unsigned long long total_size)
+{

+ if (total_size >= 4ULL<<30 && total_size < 12ULL<<30)
+ return 1ULL<<28;
+ else if (total_size >= 12ULL<<30 && total_size < 128ULL<<30)
+ return 1ULL<<29;
+ else if (total_size >= 128ULL<<30 && total_size < 256ULL<<30)
+ return 3ULL<<28;
+ else if (total_size >= 256ULL<<30 && total_size < 378ULL<<30)
+ return 1ULL<<30;
+ else if (total_size >= 318ULL<<30 && total_size < 512ULL<<30)
+ return 3ULL<<29;
+ else if (total_size >= 512ULL<<30 && total_size < 768ULL<<30)
+ return 2ULL<<30;
+ else if (total_size >= 768ULL<<30)
+ return 3ULL<<30;
+}

+#include <asm-generic/kexec.h>
+#endif
+
extern struct kimage *ia64_kimage;

extern const unsigned int relocate_new_kernel_size;
extern void relocate_new_kernel(unsigned long, unsigned long,

Amerigo Wang

unread,

Aug 12, 2009, 4:20:08 AM8/12/09

to

Introduce a new config option KEXEC_AUTO_RESERVE for powerpc.

Signed-off-by: WANG Cong <amw...@redhat.com>
Acked-by: Neil Horman <nho...@tuxdriver.com>

---

Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -346,6 +346,17 @@ config KEXEC

support. As of this writing the exact hardware interface is
strongly in flux, so no good recommendation can be made.

+config KEXEC_AUTO_RESERVE
+ bool "automatically reserve memory for kexec kernel"
+ depends on KEXEC
+ default y
+ ---help---
+ Automatically reserve memory for a kexec kernel, so that you don't
+ need to specify numbers for the "crashkernel=X@Y" boot option,

+ instead you can use "crashkernel=auto". To make this work, you need
+ to have more than 4G memory. On PPC 256M is reserved, 1/32 memory
+ on PPC64, but it will not exceed 1T/32.
+
config CRASH_DUMP
bool "Build a kdump crash kernel"
depends on PPC64 || 6xx

Amerigo Wang

unread,

Aug 12, 2009, 4:20:09 AM8/12/09

to

Implement "crashkernel=auto" for x86 first, other arch will be added in the
following patches.

Signed-off-by: WANG Cong <amw...@redhat.com>

---

Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c
+++ linux-2.6/kernel/kexec.c
@@ -37,6 +37,7 @@
#include <asm/io.h>
#include <asm/system.h>
#include <asm/sections.h>
+#include <asm/setup.h>

/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t* crash_notes;
@@ -1297,6 +1298,39 @@ int __init parse_crashkernel(char *cm

ck_cmdline += 12; /* strlen("crashkernel=") */

+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+ if (strncmp(ck_cmdline, "auto", 4) == 0) {
+ unsigned long long size;
+ int len;
+ char tmp[32];
+
+ size = arch_default_crash_size(system_ram);
+ if (size != 0) {
+ *crash_size = size;
+ *crash_base = arch_default_crash_base();
+ len = scnprintf(tmp, sizeof(tmp), "%luM@%luM",
+ (unsigned long)(*crash_size)>>20,
+ (unsigned long)(*crash_base)>>20);
+ /* 'len' can't be <= 4. */
+ if (likely((len - 4 + strlen(cmdline))
+ < COMMAND_LINE_SIZE - 1)) {
+ memmove(ck_cmdline + len, ck_cmdline + 4,
+ strlen(cmdline) - (ck_cmdline + 4 - cmdline) + 1);
+ memcpy(ck_cmdline, tmp, len);
+ }
+ return 0;
+ } else {
+ /*
+ * We can't reserve memory auotmatcally,
+ * remove "crashkernel=auto" from cmdline.
+ */
+ ck_cmdline += 4; /* strlen("auto") */
+ memmove(ck_cmdline - 16, ck_cmdline,
+ strlen(cmdline) - (ck_cmdline - cmdline) + 1);
+ return -ENOMEM;
+ }
+ }
+#endif
/*
* if the commandline contains a ':', then that's the extended
* syntax -- if not, it must be the classic syntax
Index: linux-2.6/arch/x86/include/asm/kexec.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/kexec.h
+++ linux-2.6/arch/x86/include/asm/kexec.h
@@ -23,6 +23,7 @@

#include <asm/page.h>
#include <asm/ptrace.h>
+#include <asm-generic/kexec.h>

/*
* KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
Index: linux-2.6/include/asm-generic/kexec.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/kexec.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_GENERIC_KEXEC_H
+#define _ASM_GENERIC_KEXEC_H
+
+#ifdef CONFIG_KEXEC_AUTO_RESERVE
+
+#ifndef KEXEC_AUTO_RESERVED_SIZE
+#define KEXEC_AUTO_RESERVED_SIZE (1ULL<<27) /* 128M */
+#endif
+#ifndef KEXEC_AUTO_THRESHOLD
+#define KEXEC_AUTO_THRESHOLD (1ULL<<32) /* 4G */
+#endif
+
+#ifndef ARCH_HAS_DEFAULT_CRASH_SIZE

+static inline
+unsigned long long arch_default_crash_size(unsigned long long total_size)
+{

+ if (total_size < KEXEC_AUTO_THRESHOLD)
+ return 0;
+ else {
+#ifdef CONFIG_64BIT
+ if (total_size > 1ULL<<40) /* 1TB */
+ return KEXEC_AUTO_RESERVED_SIZE
+ * (1ULL<<40 / KEXEC_AUTO_THRESHOLD);
+ return 1ULL<<ilog2(roundup(total_size/32, 1ULL<<21));
+#else
+ return KEXEC_AUTO_RESERVED_SIZE;
+#endif
+ }
+}
+#endif
+#ifndef ARCH_HAS_DEFAULT_CRASH_BASE
+static inline
+unsigned long long arch_default_crash_base(void)
+{
+ /* 0 means find the base address automatically. */
+ return 0;
+}
+#endif
+
+#endif /* CONFIG_KEXEC_AUTO_RESERVE */
+
+#endif

Amerigo Wang

unread,

Aug 12, 2009, 4:20:08 AM8/12/09

to

This patch implements shrinking the reserved memory for crash kernel,
if it is more than enough.

For example, if you have already reserved 128M, now you just want 100M,
you can do:

# echo $((100*1024*1024)) > /sys/kernel/kexec_crash_size

Signed-off-by: WANG Cong <amw...@redhat.com>
Cc: Neil Horman <nho...@redhat.com>
Cc: Eric W. Biederman <ebie...@xmission.com>
Cc: Andi Kleen <an...@firstfloor.org>

---

Index: linux-2.6/include/linux/kexec.h
===================================================================
--- linux-2.6.orig/include/linux/kexec.h
+++ linux-2.6/include/linux/kexec.h
@@ -206,6 +206,9 @@ extern size_t vmcoreinfo_max_size;

int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base);
+int shrink_crash_memory(unsigned long new_size);
+int kexec_crash_kernel_loaded(void);
+size_t get_crash_memory_size(void);

#else /* !CONFIG_KEXEC */
struct pt_regs;

Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c
+++ linux-2.6/kernel/kexec.c

@@ -1083,6 +1083,76 @@ void crash_kexec(struct pt_regs *regs)
}
}

+int kexec_crash_kernel_loaded(void)
+{
+ int ret;
+ if (!mutex_trylock(&kexec_mutex))
+ return 1;
+ ret = kexec_crash_image != NULL;
+ mutex_unlock(&kexec_mutex);
+ return ret;
+}
+
+size_t get_crash_memory_size(void)
+{
+ size_t size;
+ if (!mutex_trylock(&kexec_mutex))
+ return 1;
+ size = crashk_res.end - crashk_res.start + 1;
+ mutex_unlock(&kexec_mutex);
+ return size;
+}
+
+int shrink_crash_memory(unsigned long new_size)
+{
+ struct page **pages;
+ int ret = 0;
+ int npages, i;
+ unsigned long addr;
+ unsigned long start, end;
+ void *vaddr;
+
+ if (!mutex_trylock(&kexec_mutex))
+ return -EBUSY;
+
+ start = crashk_res.start;
+ end = crashk_res.end;
+
+ if (new_size >= end - start + 1) {
+ ret = -EINVAL;
+ if (new_size == end - start + 1)
+ ret = 0;
+ goto unlock;
+ }
+
+ start = roundup(start, PAGE_SIZE);
+ end = roundup(start + new_size, PAGE_SIZE) - 1;
+ npages = (end + 1 - start ) / PAGE_SIZE;
+
+ pages = kmalloc(sizeof(struct page *) * npages, GFP_KERNEL);
+ if (!pages) {
+ ret = -ENOMEM;
+ goto unlock;
+ }
+ for (i = 0; i < npages; i++) {
+ addr = end + 1 + i * PAGE_SIZE;
+ pages[i] = virt_to_page(addr);
+ }
+
+ vaddr = vm_map_ram(pages, npages, 0, PAGE_KERNEL);
+ if (!vaddr) {
+ ret = -ENOMEM;
+ goto free;
+ }
+ crashk_res.end = end;
+
+free:
+ kfree(pages);
+unlock:
+ mutex_unlock(&kexec_mutex);
+ return ret;
+}
+
static u32 *append_elf_note(u32 *buf, char *name, unsigned type, void *data,
size_t data_len)
{
Index: linux-2.6/kernel/ksysfs.c
===================================================================
--- linux-2.6.orig/kernel/ksysfs.c
+++ linux-2.6/kernel/ksysfs.c
@@ -100,6 +100,26 @@ static ssize_t kexec_crash_loaded_show(s
}
KERNEL_ATTR_RO(kexec_crash_loaded);

+static ssize_t kexec_crash_size_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", get_crash_memory_size());
+}
+static ssize_t kexec_crash_size_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long cnt;
+ int ret;
+
+ if (kexec_crash_kernel_loaded())
+ return -ENOENT;
+ cnt = simple_strtoul(buf, NULL, 10);
+ ret = shrink_crash_memory(cnt);
+ return ret < 0 ? ret: count;
+}
+KERNEL_ATTR_RW(kexec_crash_size);
+
static ssize_t vmcoreinfo_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -147,6 +167,7 @@ static struct attribute * kernel_attrs[]
#ifdef CONFIG_KEXEC
&kexec_loaded_attr.attr,
&kexec_crash_loaded_attr.attr,
+ &kexec_crash_size_attr.attr,
&vmcoreinfo_attr.attr,
#endif
NULL

Amerigo Wang

unread,

Aug 12, 2009, 4:20:10 AM8/12/09

to

Introduce a new config option KEXEC_AUTO_RESERVE for ia64.

Signed-off-by: WANG Cong <amw...@redhat.com>
Acked-by: Neil Horman <nho...@tuxdriver.com>

---

Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig
+++ linux-2.6/arch/ia64/Kconfig
@@ -582,6 +582,20 @@ config KEXEC

support. As of this writing the exact hardware interface is
strongly in flux, so no good recommendation can be made.

+config KEXEC_AUTO_RESERVE
+ bool "automatically reserve memory for kexec kernel"
+ depends on KEXEC
+ default y
+ ---help---
+ Automatically reserve memory for a kexec kernel, so that you don't
+ need to specify numbers for the "crashkernel=X@Y" boot option,
+ instead you can use "crashkernel=auto". To make this work, you need
+ to have more than 4G memory.

+
+ The reserved memory size is different depends on how much memory
+ you actually have. Please check Documentation/kdump/kdump.txt.
+ If you doubt, say N.
+
config CRASH_DUMP
bool "kernel crash dumps"
depends on IA64_MCA_RECOVERY && !IA64_HP_SIM && (!SMP || HOTPLUG_CPU)

Amerigo Wang

unread,

Aug 12, 2009, 4:20:14 AM8/12/09

to

Update the document for kdump.

Signed-off-by: WANG Cong <amw...@redhat.com>

---

Index: linux-2.6/Documentation/kdump/kdump.txt
===================================================================
--- linux-2.6.orig/Documentation/kdump/kdump.txt
+++ linux-2.6/Documentation/kdump/kdump.txt
@@ -147,6 +147,15 @@ System kernel config options
analysis tools require a vmlinux with debug symbols in order to read
and analyze a dump file.

+4) Enable "automatically reserve memory for kexec kernel" in
+ "Processor type and features."
+
+ CONFIG_KEXEC_AUTO_RESERVE=y
+
+ This will let you to use "crashkernel=auto", instead of specifying
+ numbers for "crashkernel=". Note, you need to have enough memory.
+ The threshold and reserved memory size are arch-dependent.
+
Dump-capture kernel config options (Arch Independent)
-----------------------------------------------------

@@ -266,6 +275,25 @@ This would mean:
2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
3) if the RAM size is larger than 2G, then reserve 128M

+Or you can use:
+
+ crashkernel=auto
+
+if you have enough memory. The threshold is 4G, below which this won't work.
+
+The automatically reserved memory size would be 128M on x86_32, 256M on
+ppc, 1/32 of your physical memory size on x86_64 and ppc64 (but it will not
+exceed 1TB/32 if you have more). IA64 has its own policy, shown below:
+
+ Memory size Reserved memory
+ =========== ===============
+ [4G, 12G) 256M
+ [12G, 128G) 512M
+ [128G, 256G) 768M
+ [256G, 378G) 1024M
+ [378G, 512G) 1536M
+ [512G, 768G) 2048M
+ [768G, ) 3072M

Boot into System Kernel

Eric W. Biederman

unread,

Aug 12, 2009, 11:20:11 PM8/12/09

to

Amerigo Wang <amw...@redhat.com> writes:

> This patch implements shrinking the reserved memory for crash kernel,
> if it is more than enough.
>
> For example, if you have already reserved 128M, now you just want 100M,
> you can do:
>
> # echo $((100*1024*1024)) > /sys/kernel/kexec_crash_size

Getting closer (comments inline)

Semantically this patch is non-contriversial and pretty
simple, but still needs a fair amount of review. Can
you put this patch at the front of your patch set.

We don't need trylock on this code path

> + ret = kexec_crash_image != NULL;
> + mutex_unlock(&kexec_mutex);
> + return ret;
> +}
> +
> +size_t get_crash_memory_size(void)
> +{
> + size_t size;
> + if (!mutex_trylock(&kexec_mutex))
> + return 1;

We don't need trylock on this code path

> + size = crashk_res.end - crashk_res.start + 1;
> + mutex_unlock(&kexec_mutex);
> + return size;
> +}
> +
> +int shrink_crash_memory(unsigned long new_size)
> +{
> + struct page **pages;
> + int ret = 0;
> + int npages, i;
> + unsigned long addr;
> + unsigned long start, end;
> + void *vaddr;
> +
> + if (!mutex_trylock(&kexec_mutex))
> + return -EBUSY;

We don't need trylock on this code path

We are missing the check to see if the crash_kernel is loaded
under this lock instance. So I please move the kexec_crash_image != NULL
test inline here and kill the kexec_crash_kernel_loaded function.

This is the wrong kernel call to use. I expect this needs to look
like a memory hotplug event. This does not put the pages into the
free page pool.

Amerigo Wang

unread,

Aug 12, 2009, 11:40:08 PM8/12/09

to

Eric W. Biederman wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>
>
>> This patch implements shrinking the reserved memory for crash kernel,
>> if it is more than enough.
>>
>> For example, if you have already reserved 128M, now you just want 100M,
>> you can do:
>>
>> # echo $((100*1024*1024)) > /sys/kernel/kexec_crash_size
>>
>
> Getting closer (comments inline)
>
> Semantically this patch is non-contriversial and pretty
> simple, but still needs a fair amount of review. Can
> you put this patch at the front of your patch set.
>
>

Sure, I will do it when I resend them next time.

I add mm people into Cc.

>> Index: linux-2.6/kernel/kexec.c
>> ===================================================================
>> --- linux-2.6.orig/kernel/kexec.c
>> +++ linux-2.6/kernel/kexec.c
>> @@ -1083,6 +1083,76 @@ void crash_kexec(struct pt_regs *regs)
>> }
>> }
>>
>> +int kexec_crash_kernel_loaded(void)
>> +{
>> + int ret;
>> + if (!mutex_trylock(&kexec_mutex))
>> + return 1;
>>
>
> We don't need trylock on this code path
>

OK.

>
>> + ret = kexec_crash_image != NULL;
>> + mutex_unlock(&kexec_mutex);
>> + return ret;
>> +}
>> +
>> +size_t get_crash_memory_size(void)
>> +{
>> + size_t size;
>> + if (!mutex_trylock(&kexec_mutex))
>> + return 1;
>>
>
> We don't need trylock on this code path
>
>

Hmm, crashk_res is a global struct, so other process can also
change it... but currently no process does that, right?

>> + size = crashk_res.end - crashk_res.start + 1;
>> + mutex_unlock(&kexec_mutex);
>> + return size;
>> +}
>> +
>> +int shrink_crash_memory(unsigned long new_size)
>> +{
>> + struct page **pages;
>> + int ret = 0;
>> + int npages, i;
>> + unsigned long addr;
>> + unsigned long start, end;
>> + void *vaddr;
>> +
>> + if (!mutex_trylock(&kexec_mutex))
>> + return -EBUSY;
>>
>
> We don't need trylock on this code path
>
> We are missing the check to see if the crash_kernel is loaded
> under this lock instance. So I please move the kexec_crash_image != NULL
> test inline here and kill the kexec_crash_kernel_loaded function.
>

Ok, no problem.

Well, I also wanted to use an memory-hotplug API, but that will make the
code depend on memory-hotplug, which certainly is not what we want...

I checked the mm code, actually what I need is an API which is similar
to add_active_range(), but add_active_range() can't be used here since
it is marked as "__init".

Do we have that kind of API in mm? I can't find one.

Thanks!

Eric W. Biederman

unread,

Aug 13, 2009, 2:20:09 AM8/13/09

to

Amerigo Wang <amw...@redhat.com> writes:

We still need the lock. Just doing trylock doesn't instead
of just sleeping doesn't seem to make any sense on these
code paths.

Perhaps we will need to remove __init from add_active_range. I know the logic
but I'm not up to speed on the mm pieces at the moment.

Eric

Amerigo Wang

unread,

Aug 13, 2009, 4:30:14 AM8/13/09

to

Eric W. Biederman wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>
>
>>>
>>>
>>>> + ret = kexec_crash_image != NULL;
>>>> + mutex_unlock(&kexec_mutex);
>>>> + return ret;
>>>> +}
>>>> +
>>>> +size_t get_crash_memory_size(void)
>>>> +{
>>>> + size_t size;
>>>> + if (!mutex_trylock(&kexec_mutex))
>>>> + return 1;
>>>>
>>>>
>>> We don't need trylock on this code path
>>>
>>>
>>>
>> Hmm, crashk_res is a global struct, so other process can also
>> change it... but currently no process does that, right?
>>
>>
>
> We still need the lock. Just doing trylock doesn't instead
> of just sleeping doesn't seem to make any sense on these
> code paths.
>
>

Ok, got it.

Not that simple, marking it as "__init" means it uses some "__init" data
which will be dropped after initialization.

Thanks.

Eric W. Biederman

unread,

Aug 14, 2009, 6:20:10 PM8/14/09

to

Amerigo Wang <amw...@redhat.com> writes:

> Not that simple, marking it as "__init" means it uses some "__init" data which
> will be dropped after initialization.

If we start with the assumption that we will be reserving to much and
will free the memory once we know how much we really need I see a very
simple way to go about this. We ensure that the reservation of crash
kernel memory is done through a normal allocation so that we have
struct page entries for every page. On 32bit x86 that is an extra 1MB
for a 128MB allocation.

Then when it comes time to release that memory we clear whatever magic
flags we have on the page (like PG_reserve) and call free_page.

Eric

Amerigo Wang

unread,

Aug 17, 2009, 6:00:17 AM8/17/09

to

Eric W. Biederman wrote:
> Amerigo Wang <amw...@redhat.com> writes:
>
>
>> Not that simple, marking it as "__init" means it uses some "__init" data which
>> will be dropped after initialization.
>>
>
> If we start with the assumption that we will be reserving to much and
> will free the memory once we know how much we really need I see a very
> simple way to go about this. We ensure that the reservation of crash
> kernel memory is done through a normal allocation so that we have
> struct page entries for every page. On 32bit x86 that is an extra 1MB
> for a 128MB allocation.
>
> Then when it comes time to release that memory we clear whatever magic
> flags we have on the page (like PG_reserve) and call free_page.
>

Hmm, my MM knowledge is not good enough to judge if this works...
I need to check more MM source code.

Can any MM people help?

Thanks.

KAMEZAWA Hiroyuki

unread,

Aug 17, 2009, 8:40:08 PM8/17/09

to

On Mon, 17 Aug 2009 17:50:21 +0800
Amerigo Wang <amw...@redhat.com> wrote:

> Eric W. Biederman wrote:
> > Amerigo Wang <amw...@redhat.com> writes:
> >
> >
> >> Not that simple, marking it as "__init" means it uses some "__init" data which
> >> will be dropped after initialization.
> >>
> >
> > If we start with the assumption that we will be reserving to much and
> > will free the memory once we know how much we really need I see a very
> > simple way to go about this. We ensure that the reservation of crash
> > kernel memory is done through a normal allocation so that we have
> > struct page entries for every page. On 32bit x86 that is an extra 1MB
> > for a 128MB allocation.
> >
> > Then when it comes time to release that memory we clear whatever magic
> > flags we have on the page (like PG_reserve) and call free_page.
> >
>
> Hmm, my MM knowledge is not good enough to judge if this works...
> I need to check more MM source code.
>
> Can any MM people help?
>

Hm, memory-hotplug guy is here.

Can I have a question ?

- How crash kernel's memory is preserved at boot ?
It's hidden from the system before mem_init() ?

Thanks,
-Kame

Amerigo Wang

unread,

Aug 18, 2009, 2:40:09 AM8/18/09

to

KAMEZAWA Hiroyuki wrote:
> On Mon, 17 Aug 2009 17:50:21 +0800
> Amerigo Wang <amw...@redhat.com> wrote:
>
>
>> Eric W. Biederman wrote:
>>
>>> Amerigo Wang <amw...@redhat.com> writes:
>>>
>>>
>>>
>>>> Not that simple, marking it as "__init" means it uses some "__init" data which
>>>> will be dropped after initialization.
>>>>
>>>>
>>> If we start with the assumption that we will be reserving to much and
>>> will free the memory once we know how much we really need I see a very
>>> simple way to go about this. We ensure that the reservation of crash
>>> kernel memory is done through a normal allocation so that we have
>>> struct page entries for every page. On 32bit x86 that is an extra 1MB
>>> for a 128MB allocation.
>>>
>>> Then when it comes time to release that memory we clear whatever magic
>>> flags we have on the page (like PG_reserve) and call free_page.
>>>
>>>
>> Hmm, my MM knowledge is not good enough to judge if this works...
>> I need to check more MM source code.
>>
>> Can any MM people help?
>>
>>
> Hm, memory-hotplug guy is here.
>

Hi, thank you!

> Can I have a question ?
>
> - How crash kernel's memory is preserved at boot ?
>

Use bootmem, I think.

> It's hidden from the system before mem_init() ?
>

Not sure, but probably yes. It is reserved in setup_arch() which is
before mm_init() which calls mem_init().

Do you have any advice to free that reserved memory after boot? :)

Thanks.

KAMEZAWA Hiroyuki

unread,

Aug 18, 2009, 4:30:21 AM8/18/09

to

On Tue, 18 Aug 2009 14:31:23 +0800
Amerigo Wang <amw...@redhat.com> wrote:
> Hi, thank you!
> > Can I have a question ?
> >
> > - How crash kernel's memory is preserved at boot ?
> >
>
> Use bootmem, I think.
>

I see.

In x86,

setup_arch()
-> reserve_crashkernel()
-> find_and_reserve_crashkernel()
-> reserve_bootmem_generic()

Then, all "active range" is already registered and there are memmap.

> > It's hidden from the system before mem_init() ?
> >
>
> Not sure, but probably yes. It is reserved in setup_arch() which is
> before mm_init() which calls mem_init().
>
> Do you have any advice to free that reserved memory after boot? :)
>

Let's see arch/x86/mm/init.c::free_initmem()

Maybe it's all you want.

- ClearPageReserved()
- init_page_count()
- free_page()
- totalram_pages++

But it has no argumetns. Maybe you need your own function or modification.
online_pages() does very similar. But, hmm,.. writing something open coded one
for crashkernel is not very bad, I think.

Thanks,
-Kame

Amerigo Wang

unread,

Aug 18, 2009, 5:00:21 AM8/18/09

to

Nice help!

Yeah, I think we can make that be a generic wrapper function so that
both free_initmem() and shrink_crash_memory() can use it.

Then I will update and resend the whole patchset.

Thank you!

Amerigo Wang

unread,

Aug 18, 2009, 6:50:12 AM8/18/09

to

KAMEZAWA Hiroyuki wrote:
> On Tue, 18 Aug 2009 14:31:23 +0800
> Amerigo Wang <amw...@redhat.com> wrote:
>
>>> It's hidden from the system before mem_init() ?
>>>
>>>
>> Not sure, but probably yes. It is reserved in setup_arch() which is
>> before mm_init() which calls mem_init().
>>
>> Do you have any advice to free that reserved memory after boot? :)
>>
>>
>
> Let's see arch/x86/mm/init.c::free_initmem()
>
> Maybe it's all you want.
>
> - ClearPageReserved()
> - init_page_count()
> - free_page()
> - totalram_pages++
>

Just FYI: calling ClearPageReserved() caused an oops: "Unable to handle
paging request".

I am trying to figure out why...

KAMEZAWA Hiroyuki

unread,

Aug 18, 2009, 8:10:05 PM8/18/09

to

On Tue, 18 Aug 2009 18:35:32 +0800
Amerigo Wang <amw...@redhat.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Tue, 18 Aug 2009 14:31:23 +0800
> > Amerigo Wang <amw...@redhat.com> wrote:
> >
> >>> It's hidden from the system before mem_init() ?
> >>>
> >>>
> >> Not sure, but probably yes. It is reserved in setup_arch() which is
> >> before mm_init() which calls mem_init().
> >>
> >> Do you have any advice to free that reserved memory after boot? :)
> >>
> >>
> >
> > Let's see arch/x86/mm/init.c::free_initmem()
> >
> > Maybe it's all you want.
> >
> > - ClearPageReserved()
> > - init_page_count()
> > - free_page()
> > - totalram_pages++
> >
>
> Just FYI: calling ClearPageReserved() caused an oops: "Unable to handle
> paging request".
>
> I am trying to figure out why...
>

Hmm...then....memmap is not there.
pfn_valid() check will help you. What arch ? x86-64 ?

Thanks,
-Kame

Amerigo Wang

unread,

Aug 18, 2009, 10:50:08 PM8/18/09

to

KAMEZAWA Hiroyuki wrote:
> On Tue, 18 Aug 2009 18:35:32 +0800
> Amerigo Wang <amw...@redhat.com> wrote:
>
>
>> KAMEZAWA Hiroyuki wrote:
>>
>>> On Tue, 18 Aug 2009 14:31:23 +0800
>>> Amerigo Wang <amw...@redhat.com> wrote:
>>>
>>>
>>>>> It's hidden from the system before mem_init() ?
>>>>>
>>>>>
>>>>>
>>>> Not sure, but probably yes. It is reserved in setup_arch() which is
>>>> before mm_init() which calls mem_init().
>>>>
>>>> Do you have any advice to free that reserved memory after boot? :)
>>>>
>>>>
>>>>
>>> Let's see arch/x86/mm/init.c::free_initmem()
>>>
>>> Maybe it's all you want.
>>>
>>> - ClearPageReserved()
>>> - init_page_count()
>>> - free_page()
>>> - totalram_pages++
>>>
>>>
>> Just FYI: calling ClearPageReserved() caused an oops: "Unable to handle
>> paging request".
>>
>> I am trying to figure out why...
>>
>>
> Hmm...then....memmap is not there.
> pfn_valid() check will help you. What arch ? x86-64 ?
>

Hmm, yes, x86_64, but this code is arch-independent, I mean it should
work or not work on all arch, no?

So I am afraid we need to use other API to free it...

Thanks.

KAMEZAWA Hiroyuki

unread,

Aug 19, 2009, 4:20:09 AM8/19/09

to

On Wed, 19 Aug 2009 10:41:13 +0800
Amerigo Wang <amw...@redhat.com> wrote:

The, problem is whether memmap is there or not. That's all.
plz see init sequence and check there are memmap.
If memory-for-crash is obtained via bootmem,
Don't you try to free memory hole ?

Thanks,
-Kame

Amerigo Wang

unread,

Aug 19, 2009, 6:50:13 AM8/19/09

to

Yes, I am checking the code. Thanks!

Amerigo Wang

unread,

Aug 20, 2009, 5:20:09 AM8/20/09

to

Hi,

It looks like that mem_map has 'struct page' for the reserved memory, I
checked my "early_node_map[] active PFN ranges" output, the reserved
memory area for crash kernel is right in one range. Am I missing
something here?

I don't know why that oops comes out, maybe because of no PTE for thoese
pages?

Thanks.

KAMEZAWA Hiroyuki

unread,

Aug 20, 2009, 8:40:08 PM8/20/09

to

On Thu, 20 Aug 2009 17:15:56 +0800
Amerigo Wang <amw...@redhat.com> wrote:

> > The, problem is whether memmap is there or not. That's all.
> > plz see init sequence and check there are memmap.
> > If memory-for-crash is obtained via bootmem,
> > Don't you try to free memory hole ?
> >
>
> Hi,
>
> It looks like that mem_map has 'struct page' for the reserved memory, I
> checked my "early_node_map[] active PFN ranges" output, the reserved
> memory area for crash kernel is right in one range. Am I missing
> something here?
>
> I don't know why that oops comes out, maybe because of no PTE for thoese
> pages?
>

Hmm ? Could you show me the code you use ?

Thanks,
-Kame

Amerigo Wang

unread,

Aug 20, 2009, 10:00:13 PM8/20/09

to

On Fri, Aug 21, 2009 at 09:34:52AM +0900, KAMEZAWA Hiroyuki wrote:
>On Thu, 20 Aug 2009 17:15:56 +0800
>Amerigo Wang <amw...@redhat.com> wrote:
>
>> > The, problem is whether memmap is there or not. That's all.
>> > plz see init sequence and check there are memmap.
>> > If memory-for-crash is obtained via bootmem,
>> > Don't you try to free memory hole ?
>> >
>>
>> Hi,
>>
>> It looks like that mem_map has 'struct page' for the reserved memory, I
>> checked my "early_node_map[] active PFN ranges" output, the reserved
>> memory area for crash kernel is right in one range. Am I missing
>> something here?
>>
>> I don't know why that oops comes out, maybe because of no PTE for thoese
>> pages?
>>
>Hmm ? Could you show me the code you use ?

(Sorry that I reply to you with my gmail, my work email can't send out
this message, probably because one of the destinations is broken...
I am the same person, don't be confused. :-)

Sure. Below is it:

+ for (addr = end + 1; addr < crashk_res.end; addr += PAGE_SIZE) {
+ printk(KERN_DEBUG "PFN is valid? %d\n", pfn_valid(addr>>PAGE_SHIFT));
+ ClearPageReserved(virt_to_page(addr));
+ init_page_count(virt_to_page(addr));
+ free_page(addr);
+ totalram_pages++;
+ }

pfn_valid() returns 1, and oops happens at ClearPageReserved().
('addr' is right between crashk_res.start and crashk_res.end)

Thank you!

KAMEZAWA Hiroyuki

unread,

Aug 20, 2009, 10:10:08 PM8/20/09

to

Confused,
if pfn_valid(addr >> PAGE_SHIFT) == true

you should do
ClearPageReserved(pfn_to_page(addr >> PAGE_SHIFT));

because addr is physical address, not virtual.
I guess crashk_res.end is physical address....

Thanks,
-Kame

Amerigo Wang

unread,

Aug 20, 2009, 10:50:10 PM8/20/09

to

Excellent! You are right!

In fact, when I read the kexec code at the first time, I thought
'crashk_res' should hold physical address too, but after reading
more code I dropped that idea, so I am wrong. :-/

I will resend the whole patchset soon. It works now!

Thanks for your nice help, Hiroyuki!