[PATCH 0/29] overview

Eric W. Biederman

unread,

Jan 19, 2005, 4:10:11 AM1/19/05

to

Andrew the following patchset is against 2.6.11-rc1-mm1 with
all of the kexec patches removed. The list of removed patches
is included below.

This patchsset is a major refresh of the kexec on panic
functionality in the kernel. The primary aim of which was to take
the requirements capture of the kernel crashdump patches and
start integrating the functionality cleanly into the kexec
patches.

Major accomplishments:
- Compat syscall support has been added.
- The crashdump capture code has been separated from the kexec on panic code.
- The kernel to jump to on panic is now loaded in place.
- A long standing bug that allowed 2 sources pages to copy data
to a single destination page has been caught and fixed.
- Support for loading an x86_64 kernel in a reserved of memory has been completed.

The crashdump code is currently slightly broken. I have attempted to
minimize the breakage so things can quick be made to work again.

With respect to a final design discussion there are two remaining
open issues. The first is how little hardware shutdown we can get away
with in the kernel that is panicing. I believe we can reduce this
to a simply NMI to the other cpus telling them to stop. This has
been address as a major concern in previous conversations.

The second is an issue is the most significant with respect to the
design of a kernel based crash dump capture implementation. How does
the crashdump capture process discover relevant information about the
kernel that just crashed? There are two options.

1) As represented by the current crashdump patches the crashdump
kernel and the kernel in which it loads are kept in sync so that
it has uptodate versions all of crashed kernels data structures
because it is built from the same source. So it only needs to
find the address of the data structures it would like to look at.

2) The relevant information if it is available when sys_kexec_load
is called is exported to user space, or the machine_crash_shutdown
method marshalls what little information must be captured when the
machine dies in a well known standard format (most likely ELF
notes). Allowing the crashdump capture process to simply pass
on the information or utilize it as appropriate.

If the second method can successfully represent all of the
interesting information then we can allow kernel version
skew, between the two kernels, and potentially implement
the entire crash dump capture process in user space.

As best as I have been able to discover the interesting information
includes. The cpu state (registers) at the time of the crash/panic.
The list of memory regions the kernel that has crashed was using.
And potentially the list of pages dedicated to kernel data as opposed
to user space, so the the people with insane amounts of memory (1TB+)
don't require unmanagely large core files.

Andrew earlier when asked about the possibility of merghing the kexec
on panic code you said:

> I don't want us to be in a position of merging all that code and then
> finding out that it cannot be made to work "sufficiently well",
> forcing us to revert it and find a new crashdump solution. You guys
> know far better than I when we will reach that threshold. If the
> kexec/dump developers can say "yup, this is going to work (because X)"
> then I'm happy.

So here is my subjective view.
- This code needs to sit in a development tree for a little while
to shake out whatever bugs still linger from my massive refactoring.
- Through the kexec patches the code and design appears to be sound.
Given that machine_kexec is little more than a jump there are few
possible implementations that will be able to use it. The only
exception I can see are running special dump drivers from the kernel
that crashed, and I believe no one thinks the that will work well.
- Once we finish sorting out the best way to get information out of
the kernel that crashed I think we will have a complete architecture
that is largely portable to any architecture.

In the interests of full disclosure my main interesting is using the
kernel as a bootloader for other kernels and that has been working
fairly for years now :)

Eric

# Patches to remove from 2.6.11-rc1-mm1 before applying this patchset:
#
assign_irq_vector-section-fix.patch
kexec-i8259-shutdowni386.patch
kexec-i8259-shutdown-x86_64.patch
kexec-apic-virtwire-on-shutdowni386patch.patch
kexec-apic-virtwire-on-shutdownx86_64.patch
kexec-ioapic-virtwire-on-shutdowni386.patch
kexec-apic-virt-wire-fix.patch
kexec-ioapic-virtwire-on-shutdownx86_64.patch
kexec-e820-64bit.patch
kexec-kexec-generic.patch
kexec-ide-spindown-fix.patch
kexec-ifdef-cleanup.patch
kexec-machine_shutdownx86_64.patch
kexec-kexecx86_64.patch
kexec-kexecx86_64-4level-fix.patch
kexec-kexecx86_64-4level-fix-unfix.patch
kexec-machine_shutdowni386.patch
kexec-kexeci386.patch
kexec-use_mm.patch
kexec-loading-kernel-from-non-default-offset.patch
kexec-loading-kernel-from-non-default-offset-fix.patch
kexec-enabling-co-existence-of-normal-kexec-kernel-and-panic-kernel.patch
kexec-ppc-support.patch
#kexec-kexecppc.patch
#
crashdump-documentation.patch
crashdump-memory-preserving-reboot-using-kexec.patch
crashdump-memory-preserving-reboot-using-kexec-fix.patch
kdump-config_discontigmem-fix.patch
crashdump-routines-for-copying-dump-pages.patch
crashdump-routines-for-copying-dump-pages-kmap-fiddle.patch
crashdump-kmap-build-fix.patch
crashdump-register-snapshotting-before-kexec-boot.patch
crashdump-elf-format-dump-file-access.patch
crashdump-linear-raw-format-dump-file-access.patch
crashdump-minor-bug-fixes-to-kexec-crashdump-code.patch
crashdump-cleanups-to-the-kexec-based-crashdump-code.patch
#
x86-rename-apic_mode_exint.patch
x86-local-apic-fix.patch
#
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Vivek Goyal

unread,

Jan 21, 2005, 2:10:12 AM1/21/05

to

Hi Andrew,

Following patch is against 2.6.11-rc1-mm2.

As mentioned by following note from Eric, crashdump code is currently
broken.

>
> The crashdump code is currently slightly broken. I have attempted to
> minimize the breakage so things can quick be made to work again.

We have started doing changes to make crashdump up and running again.
Following are few identified items to be done.

1. Reserve the backup region (640k) during kernel bootup.
2. Copy the data to backup region during crash.(moved to kexec user
space code, patch posted in separate mail)
3. Prepare elf headers while loading kexec panic kernel and store in
reserved memory area.
4. Pass required information to crashdump kernel, which parses it and
exports through /proc/vmcore. (may be user space utility, open to
discussion)

Following patch implements item 1) in the list. Soon we shall be rolling
out the patches for rest.

Thanks
Vivek

crashdump-x86-reserve-640k-memory.patch

Eric W. Biederman

unread,

Jan 21, 2005, 3:10:10 AM1/21/05

to

Vivek Goyal <vgo...@in.ibm.com> writes:

> Hi Andrew,
>
> Following patch is against 2.6.11-rc1-mm2.
>
> As mentioned by following note from Eric, crashdump code is currently
> broken.
> >
> > The crashdump code is currently slightly broken. I have attempted to
> > minimize the breakage so things can quick be made to work again.
>
> We have started doing changes to make crashdump up and running again.
> Following are few identified items to be done.
>
> 1. Reserve the backup region (640k) during kernel bootup.

Why do we need a separate region for this?

It should be simple enough to take 640 out of the area kexec reserves
for the crash dump kernel. That is what the previous code implemented.

> 2. Copy the data to backup region during crash.(moved to kexec user
> space code, patch posted in separate mail)

Thanks by and large it looks sane, it won't work yet the but it is
moving in the right direction.

> +++ linux-2.6.11-rc1-mm2-kexec-eric-root/include/linux/kexec.h 2005-01-20
> 13:55:33.000000000 +0530
>
> @@ -79,7 +79,7 @@ struct kimage {
> unsigned long control_page;
>
> /* Flags to indicate special processing */
> - int type : 1;
> + unsigned int type : 1;

That looks like a sane bug fix. Having values of 0 and -1 is quite what
I was expecting...

Eric

Vivek Goyal

unread,

Jan 21, 2005, 5:20:25 AM1/21/05

to

On Fri, 2005-01-21 at 13:24, Eric W. Biederman wrote:
> Vivek Goyal <vgo...@in.ibm.com> writes:
>
> > Hi Andrew,
> >
> > Following patch is against 2.6.11-rc1-mm2.
> >
> > As mentioned by following note from Eric, crashdump code is currently
> > broken.
> > >
> > > The crashdump code is currently slightly broken. I have attempted to
> > > minimize the breakage so things can quick be made to work again.
> >
> > We have started doing changes to make crashdump up and running again.
> > Following are few identified items to be done.
> >
> > 1. Reserve the backup region (640k) during kernel bootup.
>
> Why do we need a separate region for this?
>
> It should be simple enough to take 640 out of the area kexec reserves
> for the crash dump kernel. That is what the previous code implemented.

Previous code also reserved the backup memory region after crash kernel
region. It is just a matter of interpretation. What I understand that
crash kernel reserved region represents something where one can load the
panic kernel directly and new kernel can use this memory region for
memory allocation.

I don't want to steal the backup region from crash kernel region
otherwise, I shall have to boot the crash kernel with some strange
values like memmap=(32M-640k)@16M (symbolically) to prevent crash kernel
overwriting backup region. Why to make user aware of location of backup
region.

Alternatively, this can be managed by reserving this backup region again
in crash kernel to avoid any stomping. May be pass backup region
location to new kernel through parameter segment or through command line
but don't see a strong reason for doing that.

Thanks
Vivek

Eric W. Biederman

unread,

Jan 21, 2005, 6:20:09 AM1/21/05

to

On deeper review your patch as it stands is incomplete. In particular
you don't provide a way to either hardcode or dynamically set
the area you are attempt to reserve to hold the backup region.

Vivek Goyal <vgo...@in.ibm.com> writes:

> On Fri, 2005-01-21 at 13:24, Eric W. Biederman wrote:
> > Why do we need a separate region for this?
> >
> > It should be simple enough to take 640 out of the area kexec reserves
> > for the crash dump kernel. That is what the previous code implemented.
>
> Previous code also reserved the backup memory region after crash kernel
> region. It is just a matter of interpretation. What I understand that
> crash kernel reserved region represents something where one can load the
> panic kernel directly and new kernel can use this memory region for
> memory allocation.

Yes the reservation is a hunk of memory reserved for use by the crashdump
process, or whatever happens after panic. It is up to the loaded code
to define how that memory is used. purgatory.ro is a legitimate part
of that loaded code.

> I don't want to steal the backup region from crash kernel region
> otherwise, I shall have to boot the crash kernel with some strange
> values like memmap=(32M-640k)@16M (symbolically) to prevent crash kernel
> overwriting backup region. Why to make user aware of location of backup
> region.

Making the user aware of the region makes it one more thing for the user
to be aware of and to manually manage. Based on what was passed as
crashkernel=... We should be able to automate all of the rest of it.
So a weird memmap= line should not be hard.

I will have to wait and see but it would not surprise me if we settled
on a fixed address per architecture for the reservation to make it
easier for various users.

On that note we probably want to move the magic that we are doing
for crashdumps into the linux loader (i.e. x86-linux-setup.c ) in
kexec-tools, as most of these pieces are specific to taking a
crashdump with linux. Not that I expect we will be doing it with
anything else but...

> Alternatively, this can be managed by reserving this backup region again
> in crash kernel to avoid any stomping. May be pass backup region
> location to new kernel through parameter segment or through command line
> but don't see a strong reason for doing that.

Probably the biggest reason for doing it in one reservation is that
it happens to be an implementation detail of the crashdump capture
kernel. If that kernel is not SMP I believe you can safely leave the
first 640k alone. I know at least one other effort has had success in
that area.

In general it is not good to make unnecessary implementation details
between two pieces of software be part of their interface.

Eric

Vivek Goyal

unread,

Jan 23, 2005, 4:30:13 AM1/23/05

to

On Fri, 2005-01-21 at 16:43, Eric W. Biederman wrote:
> On deeper review your patch as it stands is incomplete. In particular
> you don't provide a way to either hardcode or dynamically set
> the area you are attempt to reserve to hold the backup region.

Well. Here is the new patch. This one steals the 640k from top of memory
region reserved for crash kernel.

A new command line parameter (crashbackup=) has been introduced for
crash dump kernels. This parameter specifies the location of backup
region from where to retrieve the backup data.

Thanks
Vivek

crashdump-x86-reserve-640k-memory.patch

Andrew Morton

unread,

Jan 26, 2005, 9:20:09 PM1/26/05

to

ebie...@xmission.com (Eric W. Biederman) wrote:
>
> There is evil intermingling and false dependency sharing between
> the dying kernel and the crash capture kernel in this patch,

yikes! I'll drop it from -mm while we have a rethink.

Eric W. Biederman

unread,

Jan 26, 2005, 10:20:09 PM1/26/05

to

Right now I am very frustrated with reviewing any of the crashdump
patches. When I make comments usually things change just enough that
what I said is addressed but things are addressed very much at
a surface level. Which means that if I think any kind of substantial
change is needed the only way I seem to be able to communicate
that is by actually implementing it myself.

Code that works today is great it does manages the job of requirements
capture. But just throwing code together when you are dealing
with fundamental interface boundaries is not a good way to build
a sustainable design. And with the crashdump code I want an
interface that is at least as simple and as stable as the syscall
interface.

At the very least if a patch is just a snapshot of your development
process up for comment and you are going to continue on making
headway please say as much. If I know the code is quite possibly
going to change in some pretty fundamental ways I can stop worrying
about it. This patch is certainly nothing I would want for more
than a couple of day hack, in my personal development tree.

I will try once again...

There is evil intermingling and false dependency sharing between

the dying kernel and the crash capture kernel in this patch, and
virtually all of the code is unnecessary. I have already addressed
why.

Vivek Goyal <vgo...@in.ibm.com> writes:

What is wrong with user space doing all of the extra space
reservation?

Could you send this fairly obvious kexec fix, as a separate patch?

> diff -puN include/linux/kexec.h~crashdump-x86-reserve-640k-memory
> include/linux/kexec.h
>
> --- linux-2.6.11-rc1/include/linux/kexec.h~crashdump-x86-reserve-640k-memory
> 2005-01-22 14:16:27.000000000 +0530
>
> +++ linux-2.6.11-rc1-root/include/linux/kexec.h 2005-01-22 14:16:27.000000000

> +0530
>
> @@ -79,7 +79,7 @@ struct kimage {
> unsigned long control_page;
>
> /* Flags to indicate special processing */
> - int type : 1;
> + unsigned int type : 1;

> #define KEXEC_TYPE_DEFAULT 0
> #define KEXEC_TYPE_CRASH 1
> };

Vivek Goyal

unread,

Jan 27, 2005, 8:00:16 AM1/27/05

to

Hi Eric,

It looks like we are looking at things a little differently. I
see a portion of the picture in your mind, but obviously not
entirely.

Perhaps, we need to step back and iron out in specific terms what
the interface between the two kernels should be in the crash dump
case, and the distribution of responsibility between kernel, user space
and the user.

[BTW, the patch was intended as a step in development up for
comment early enough to be able to get agreement on the interface
and think issues through to more completeness before going
too far. Sorry, if that wasn't apparent.]

When you say "evil intermingling", I'm guessing you mean the
"crashbackup=" boot parameter ? If so, then yes, I agree it'd
be nice to find a way around it that doesn't push hardcoding
elsewhere.

Let me explain the interface/approach I was looking at.

1.First kernel reserves some area of memory for crash/capture kernel as
specified by crashkernel=X@Y boot time parameter.

2.First kernel marks the top 640K of this area as backup area. (If
architecture needs it.) This is sort of a hardcoding and probably this
space reservation can be managed from user space as well as mentioned by
you in this mail below.

3. Location of backup region is exported through /proc/iomem which can
be read by user space utility to pass this information to purgatory code
to determine where to copy the first 640K.

Note that we do not make any additional reservation for the
backup region. We carve this out from the top of the already
reserved region and export it through /proc/iomem so that
the user space code and the capture kernel code need not
make any assumptions about where this region is located.

4. Once the capture kernel boots, it needs to know the location of
backup region for two purposes.

a. It should not overwrite the backup region.

b. There needs to be a way for the capture tool to access the original
contents of the backed up region

Boot time parameter crashbackup=A@B has been provided to pass this
information to capture kernel. This parameter is valid only for capture
kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.

> What is wrong with user space doing all of the extra space
> reservation?

Just for clarity, are you suggesting kexec-tools creating an additional
segment for the backup region and pass the information to kernel.

There is no problem in doing reservation from user space except
one. How does the user and in-turn capture kernel come to know the
location of backup region, assuming that the user is going to provide
the exactmap for capture kernel to boot into.

Just a thought, is it a good idea for kexec-tools to be creating and
passing memmap parameters doing appropriate adjustment for backup
region.

I had another question. How is the starting location of elf headers
communicated to capture tool? Is parameter segment a good idea? or
some hardcoding?

Another approach can be that backup area information is encoded in elf
headers and capture kernel is booted with modified memmap (User gets
backup region information from /proc/iomem) and capture tool can
extract backup area information from elf headers as stored by first
kernel.

Could you please elaborate a little more on what aspect of your view
differs from the above.

Thanks
Vivek

Eric W. Biederman

unread,

Jan 27, 2005, 4:00:18 PM1/27/05

to

For the guys on ppc, and other architectures that have all of their
cpu memory behind an iommu. I propose we create a /proc/cpumem
which is the subset of /proc/iomem that deals with RAM. In any event
as something like that is straight forward to implement I will
assume the existence of the functionality and we can attack the
details when we do the merge the first of those architectures
into the kernel.

Vivek Goyal <vgo...@in.ibm.com> writes:

> Hi Eric,
>
> It looks like we are looking at things a little differently. I
> see a portion of the picture in your mind, but obviously not
> entirely.
>
> Perhaps, we need to step back and iron out in specific terms what
> the interface between the two kernels should be in the crash dump
> case, and the distribution of responsibility between kernel, user space
> and the user.
>
> [BTW, the patch was intended as a step in development up for
> comment early enough to be able to get agreement on the interface
> and think issues through to more completeness before going
> too far. Sorry, if that wasn't apparent.]

It wasn't quite, and the fact that Andrew picked it up added
to the confusion.

> When you say "evil intermingling", I'm guessing you mean the
> "crashbackup=" boot parameter ? If so, then yes, I agree it'd
> be nice to find a way around it that doesn't push hardcoding
> elsewhere.

I believe there are some alternatives to crashbackup= in the
crashdump capture kernel. But as long as that code is running
in the kernel we can't do a lot better.

However for the primary kernel it has no need to know that we
even have a backup region, nor does it need to know about the
size of the backup region. That can all be handled with the single
reservation, we have now.

/sbin/kexec which makes the backup needs to know about it and it needs
to pass that information on. But the primary kernel does not.

The largest reason I am sensitive to this issue is that if you are not
booting an SMP kernel I don't believe we need a backup region on x86
at all. If we can remove that dependency I want the freedom to do
that without having to modify the primary kernel. Or if we discover
we need to preserve other things like the ACPI, mp and pirq tables
I don't want to require patching the kernel just so I can copy those
and preserve them.

> Let me explain the interface/approach I was looking at.
>
> 1.First kernel reserves some area of memory for crash/capture kernel as
> specified by crashkernel=X@Y boot time parameter.
>
> 2.First kernel marks the top 640K of this area as backup area. (If
> architecture needs it.) This is sort of a hardcoding and probably this
> space reservation can be managed from user space as well as mentioned by
> you in this mail below.
>
> 3. Location of backup region is exported through /proc/iomem which can
> be read by user space utility to pass this information to purgatory code
> to determine where to copy the first 640K.

And 1-3 can be done in /sbin/kexec. And if it is done there we can
increase our freedom of implementation in the crashdump capture process
quite a bit.

> Note that we do not make any additional reservation for the
> backup region. We carve this out from the top of the already
> reserved region and export it through /proc/iomem so that
> the user space code and the capture kernel code need not
> make any assumptions about where this region is located.
>
> 4. Once the capture kernel boots, it needs to know the location of
> backup region for two purposes.
>
> a. It should not overwrite the backup region.
>
> b. There needs to be a way for the capture tool to access the original
> contents of the backed up region
>
> Boot time parameter crashbackup=A@B has been provided to pass this
> information to capture kernel. This parameter is valid only for capture
> kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.

But that is not what you implemented. crashbackup= was an alternative
to the carving out of 640K in parts 1-3.

> > What is wrong with user space doing all of the extra space
> > reservation?
>
> Just for clarity, are you suggesting kexec-tools creating an additional
> segment for the backup region and pass the information to kernel.

Yes, having kexec create a bss segment for the backup region would
be a good idea. It will keep us from stomping on the kernel trampoline
(think the identity mapped x86_64 page tables here) by accident.

> There is no problem in doing reservation from user space except
> one. How does the user and in-turn capture kernel come to know the
> location of backup region, assuming that the user is going to provide
> the exactmap for capture kernel to boot into.
>
> Just a thought, is it a good idea for kexec-tools to be creating and
> passing memmap parameters doing appropriate adjustment for backup
> region.

Exactly. Having /sbin/kexec do this instead of the user doing this
manually is a much simpler solution than we have now.

> I had another question. How is the starting location of elf headers
> communicated to capture tool? Is parameter segment a good idea? or
> some hardcoding?

I recognize the need for that information. But I do not recognize
the need for it to be an ELF header (we do need something
conceptually close). If we don't have regions of the memory map
appearing and disappearing dynamically we can get this information
from /proc/iomem, before the crash and store it in one of the data
segments that we checksum.

> Another approach can be that backup area information is encoded in elf
> headers and capture kernel is booted with modified memmap (User gets
> backup region information from /proc/iomem) and capture tool can
> extract backup area information from elf headers as stored by first
> kernel.
>
> Could you please elaborate a little more on what aspect of your view
> differs from the above.

See above.

The direction I would take if I was to take to implementing this
the crashdump functionality is something different.

Instead of patching crashdump functionality into the kernel,
I would create a subdirectory in kexec-tools called crashdump
and put in the source for a user-space program that could run as
init. In addition I would but in the code to generate and
initramfs cpio.gz archive of that program. And I would build
the program against uclibc, klibc or one of the other libc variants
that actually allows building static binaries. Unless something
has changed recently glibc does not all for truly static binaries
as it dynamically open /lib/libnss*

Given the pain of building against an external library that is not
widely distributed I would probably take a snapshot of the code and
place it in crashdump/libc in the kexec-tools source. Taking a
snapshot of frequently used libraries is commonly done with the
gnu toolchain and is wonderfully effective in resolving painful
dependencies.

The crashdump /init would just mmap /dev/mem to read the raw memory.
From there it would generate the core file.

When kexec'ing a panic kernel I would simply have /sbin/kexec
unconditionally load that cpio.gz as the initrd and things
would work.

The large advantage of doing all of this in user space
is that it moves all of the crashdump policy into user space
and into one source tree, for simplified maintenance.

However as long as we gracefully handle the interface
between the primary kernel and the capture kernel we can
switch mechanisms for actually taking the crash dump,
kernel based or user space as seems most sane.

Eric

Vivek Goyal

unread,

Jan 28, 2005, 7:20:11 AM1/28/05

to

Hi Eric,

However for the primary kernel it has no need to know that we
> even have a backup region, nor does it need to know about the
> size of the backup region. That can all be handled with the single
> reservation, we have now.
>
> /sbin/kexec which makes the backup needs to know about it and it needs
> to pass that information on. But the primary kernel does not.

Agreed. Primary kernel need not to be aware of backup region and
reservation of this region can be better managed from user space.

> > Boot time parameter crashbackup=A@B has been provided to pass this
> > information to capture kernel. This parameter is valid only for capture
> > kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.
>
> But that is not what you implemented. crashbackup= was an alternative
> to the carving out of 640K in parts 1-3.

Not really. crashbackup= is not being used for carving out backup
region. It is just used for passing the address of this region to second
kernel. That's why it has been put under CONFIG_CRASH_DUMP.

This looks good. So memory regions are parsed from /proc/iomem and this
information is put in one data segment and stored in reserved region
during panic kernel load time.

But I am unable to co-relate as to how the capture tool (even if its all
in user space) gets to know the address of this segment (or for that
matter, the bss segment created for backup region). Am I missing
something obvious.

This seems to be a good direction.

Thanks
Vivek

Eric W. Biederman

unread,

Jan 28, 2005, 3:40:13 PM1/28/05

to

Vivek Goyal <vgo...@in.ibm.com> writes:

> Hi Eric,
>
>
> However for the primary kernel it has no need to know that we
> > even have a backup region, nor does it need to know about the
> > size of the backup region. That can all be handled with the single
> > reservation, we have now.
> >
> > /sbin/kexec which makes the backup needs to know about it and it needs
> > to pass that information on. But the primary kernel does not.
>
>
> Agreed. Primary kernel need not to be aware of backup region and
> reservation of this region can be better managed from user space.

Good. It sound like we are pretty much back on the same page then.

> > > Boot time parameter crashbackup=A@B has been provided to pass this
> > > information to capture kernel. This parameter is valid only for capture
> > > kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.
> >
> > But that is not what you implemented. crashbackup= was an alternative
> > to the carving out of 640K in parts 1-3.
>
>
> Not really. crashbackup= is not being used for carving out backup
> region. It is just used for passing the address of this region to second
> kernel. That's why it has been put under CONFIG_CRASH_DUMP.

Ok I missed a piece in your patch. You have crashdumpk_res, and
crashbackup_start, crashbackup_end. And I missed the fact that
they were different variables as they dealt with the same concept.

So that patch actually should have been three patches. The
one line bug fix. The crashdumpk_res bit (which I strongly
object to) and the crashbackup_start/_end bit. The fact
that all three were in the same patch is a reviewing and maintenance
pain.

Please in the future do not include code that runs in the primary
kernel and crashdump specific code that runs in the capture kernel in
the same patch.

> This looks good. So memory regions are parsed from /proc/iomem and this
> information is put in one data segment and stored in reserved region
> during panic kernel load time.
>
> But I am unable to co-relate as to how the capture tool (even if its all
> in user space) gets to know the address of this segment (or for that
> matter, the bss segment created for backup region). Am I missing
> something obvious.

There are a lot of choices at that point.
Place the data in the on the kernel command line, and pick
it up from /proc/cmdline.
Place the data in a file on the initramfs.
Place the data in a user space data segment.

> > However as long as we gracefully handle the interface
> > between the primary kernel and the capture kernel we can
> > switch mechanisms for actually taking the crash dump,
> > kernel based or user space as seems most sane.
>
>
> This seems to be a good direction.

Cool.

One of the ideas worth exploring is to see about stabilizing the
other side of this interface as well. That is we should explore
providing a fixed interface coming out of purgatory.ro to the new
kernel and it's user space (i.e. the ELF header like thing). I think
we are quite close to that point already. And this goes back to your
question of how do we let the capture kernel/user space know where to
look.

Eric

Koichi Suzuki

unread,

Feb 1, 2005, 3:20:08 AM2/1/05

to

Hook in panic code is very good idea and is useful in various scenes.
It could be used to kick RAM dump code, obviously, and also kick the
code to initiate failover, etc. Various use could be possible so I
believe that this hook should be prepared for wider use.

--
Koichi Suzuki
NTT DATA Intellilink Corp.

Eric W. Biederman

unread,

Feb 1, 2005, 4:20:05 AM2/1/05

to

Koichi Suzuki <koi...@intellilink.co.jp> writes:

> Hook in panic code is very good idea and is useful in various scenes. It could
> be used to kick RAM dump code, obviously, and also kick the code to initiate
> failover, etc. Various use could be possible so I believe that this hook
> should be prepared for wider use.

It is. Basically it is the normal kexec interface that allows you to
boot another kernel. With a few restrictions that should keep it as
reliable as possible when the kernel has not shut itself down cleanly.

The hardest case is to do a useful system core dump. As that requires
looking at what has gone before. For the rest if you can do it
with a kernel and a initramfs you are in good shape.

There seems to be a significant amount of interest in the full
system core dump case so that is what the work is concentrating
on.

Vivek Goyal

unread,

Feb 1, 2005, 9:30:20 AM2/1/05

to

Well, trying to put the already discussed ideas together. I was
planning to work on following design. Please comment.

Crashed Kernel <-->Capture Kernel(or User Space) Interface:
----------------------------------------------------------

The whole idea is that Crash image is represented in ELF Core format.
These ELF Headers are prepared by kexec-tools user space and put in one
segment. Address of start of image is passed to the capture kernel(or
user space) using one command line (eg. crashimage=). Now either kernel
space or user space can parse the elf headers and extract required
information and export final kernel elf core image.

> ebie...@xmission.com wrote:
> If we were using an ELF header I would include one PT_NOTE program
> header per cpu (Giving each cpu it's own area to mess around in).
> And I would use one PT_LOAD segment per possible memory zone.
> So in the worst case (current sgi altix) (MAX_NUMNODES=256,
> MAX_NR_ZONES=3, MAX_NR_CPUS=1024) 256*3+1024 = 1792 program
> headers. At 56 bytes per 64bit program header that is 100352 bytes
> or 98KiB. A little bit expensive. A tuned data structure with
> 64bit base and size would only consume 1792*16 = 28672 or 28KiB.

If I prepare One elf header for each physical contiguous memory area (as
obtained from /proc/iomem) instead of per zone, then number of elf
headers will come down significantly. I don't have any idea on number of
actual physically contiguous regions present per machine, but roughly
assuming it to be 1 per node, it will lead to 256 + 1024 = 1280 program
headers.At 56 bytes per 64 bit program header this will amount to 70KB.

This is worst case estimate and on lower end machines this will require
much less a space. On machines as big as 1024 cpus, this should not be a
concern, as big machines come with big RAMs.

Eric, do you still think that ELF headers are inappropriate to be passed
across interface boundary.

ELF headers can be prepared by kexec-tools in advance and put into one
of the data segments. This requires following information to be
available to user space.

- Starting address of space reserved by kernel for notes section
(crash_notes[]). Probably can be obtained from /proc/kallsysms?

- NR_CPUS. May be sysconf(_SC_NPROCESSORS_CONF) should be sufficient.

- Size of memory reserved per cpu. No clue how to get that? Any
suggestions?
May be hard-coding like 1K area per cpu should be to address the
future needs ?

Regarding Backup Region
-----------------------

- Kexec user space does the reservation for backup region segment.
- Purgatory copies the backup data to backup region. (Already
implemented)
- A separate elf header is prepared to represent backed up memory
region. And "offset" field of this program header can contain the actual
physical address where backup contents are stored.

Thanks
Vivek

Eric W. Biederman

unread,

Feb 1, 2005, 10:40:11 AM2/1/05

to

Vivek Goyal <vgo...@in.ibm.com> writes:

> Well, trying to put the already discussed ideas together. I was
> planning to work on following design. Please comment.
>
> Crashed Kernel <-->Capture Kernel(or User Space) Interface:
> ----------------------------------------------------------
>
> The whole idea is that Crash image is represented in ELF Core format.
> These ELF Headers are prepared by kexec-tools user space and put in one
> segment. Address of start of image is passed to the capture kernel(or
> user space) using one command line (eg. crashimage=). Now either kernel
> space or user space can parse the elf headers and extract required
> information and export final kernel elf core image.

Sounds sane. We need to make certain there is a checksum of that
region but putting it in a separate segment should ensure that.

I also think we need to look at the name crashimage= and see if we
can find something more descriptive. But that is minor. Possibly
elfcorehdr= We have a little while to think about that one before we
are stuck.

> > ebie...@xmission.com wrote:
> > If we were using an ELF header I would include one PT_NOTE program
> > header per cpu (Giving each cpu it's own area to mess around in).
> > And I would use one PT_LOAD segment per possible memory zone.
> > So in the worst case (current sgi altix) (MAX_NUMNODES=256,
> > MAX_NR_ZONES=3, MAX_NR_CPUS=1024) 256*3+1024 = 1792 program
> > headers. At 56 bytes per 64bit program header that is 100352 bytes
> > or 98KiB. A little bit expensive. A tuned data structure with
> > 64bit base and size would only consume 1792*16 = 28672 or 28KiB.
>
> If I prepare One elf header for each physical contiguous memory area (as
> obtained from /proc/iomem) instead of per zone, then number of elf
> headers will come down significantly.

A clarification on terminology we are talking about struct Elf64_Phdr
here. There is only one Elf header. That seems to be clear farther
down.

> I don't have any idea on number of
> actual physically contiguous regions present per machine, but roughly
> assuming it to be 1 per node, it will lead to 256 + 1024 = 1280 program
> headers.At 56 bytes per 64 bit program header this will amount to 70KB.
>
> This is worst case estimate and on lower end machines this will require
> much less a space. On machines as big as 1024 cpus, this should not be a
> concern, as big machines come with big RAMs.

Agreed. Size is not the primary issue. There is some clear waste
but that is a secondary concern. Not performing a 1-1 mapping
to the kernel data structures also seems to be a win, as the concepts
are noticeably different.

> Eric, do you still think that ELF headers are inappropriate to be passed
> across interface boundary.

I have serious concerns about the kernel generating the ELF headers
and only delivering them after the kernel has crashed. Because
then we run into questions of what information can be trusted. If we
avoid that issue I am not too concerned.

> ELF headers can be prepared by kexec-tools in advance and put into one
> of the data segments. This requires following information to be
> available to user space.

For a first round doing it in user space sounds sane. Obtaining
the information at the time of load is much more robust.

> - Starting address of space reserved by kernel for notes section
> (crash_notes[]). Probably can be obtained from /proc/kallsysms?

At least for a start.

> - NR_CPUS. May be sysconf(_SC_NPROCESSORS_CONF) should be
> sufficient.

Either that or /proc/cpuinfo. But the sysconf approach looks more
robust at this point.

> - Size of memory reserved per cpu. No clue how to get that? Any
> suggestions?
> May be hard-coding like 1K area per cpu should be to address the
> future needs ?

The nice thing about doing this in user space is that we can hack
something together and get each side of the interface sorted
out independently. i.e. We can hard code it for now. Sort out
the users and then come back and make certain we have the information
exported cleanly. 1K per cpu currently matches the kernel code so
it is a good place to start :)

It does look like getting the size of each array element is a problem,
so the current kernel code certainly needs to be revisited. And
there are quite a few other things pieces of how we are obtaining
the information that can be fixed as well.

> Regarding Backup Region
> -----------------------
>
> - Kexec user space does the reservation for backup region segment.
> - Purgatory copies the backup data to backup region. (Already
> implemented)
> - A separate elf header is prepared to represent backed up memory
> region. And "offset" field of this program header can contain the actual
> physical address where backup contents are stored.

I like that. I was thinking a virtual versus physical address
separation. But using the offset field is much more appropriate,
and it leaves us the potential of doing something nice like specifying
the kernels virtual address later on. Looking exclusively at the
offset field to know which memory addresses to dump sounds good.
For now we should have virtual==physical==offset except for the
backup region.

This sounds like a good place to start.

Eric

Koichi Suzuki

unread,

Feb 2, 2005, 2:20:09 AM2/2/05

to

ebie...@xmission.com wrote:
> Koichi Suzuki <koi...@intellilink.co.jp> writes:
>
>
>>Hook in panic code is very good idea and is useful in various scenes. It could
>>be used to kick RAM dump code, obviously, and also kick the code to initiate
>>failover, etc. Various use could be possible so I believe that this hook
>>should be prepared for wider use.
>
>
> It is. Basically it is the normal kexec interface that allows you to
> boot another kernel. With a few restrictions that should keep it as
> reliable as possible when the kernel has not shut itself down cleanly.
>
> The hardest case is to do a useful system core dump. As that requires
> looking at what has gone before. For the rest if you can do it
> with a kernel and a initramfs you are in good shape.
>
> There seems to be a significant amount of interest in the full
> system core dump case so that is what the work is concentrating
> on.
>
> Eric
>

I meant with kexec and dump hook, there could be many more things can be
done in addition to full core dump. Initiating failover to other node
will be one example. Starting with this hook, there must be many good
ideas. So my idea is to make this hook general purpose, not for
specific core dump tool.

Koichi Suzuki

Itsuro Oda

unread,

Feb 2, 2005, 2:20:09 AM2/2/05

to

Hi,

I can't understand why ELF format is necessary.

I think the only necessary information is "what physical address
regions are valid to read". This information is necessary for any
sort of dump tools. (and must get it while the system is normal.)
The Eric's /proc/cpumem idea sounds nice to me.

--
Itsuro ODA <o...@valinux.co.jp>

Itsuro Oda

unread,

Feb 2, 2005, 2:50:10 AM2/2/05

to

Hi,

I don't like calling crash_kexec() directly in (ex.) panic().
It should be call_dump_hook() (or something like this).

I think the necessary modifications of the kernel is only:
- insert the hooks that calls a dump function when crash occur
- binding interface that binds a dump function to the hook
(like register_dump_hook())
- supply the information of valid physical address regions
(- maybe some existent functions and variables need to be exported ?)

I think this makes any sort of dump functions can be implemented
as a kernel module. I don't think it is best way that the "kexec based
crashdump" is built in the kernel.

Thanks.

On 01 Feb 2005 02:06:42 -0700

ebie...@xmission.com (Eric W. Biederman) wrote:

--
Itsuro ODA <o...@valinux.co.jp>

Koichi Suzuki

unread,

Feb 2, 2005, 3:00:11 AM2/2/05

to

Itsuro Oda wrote:
> Hi,
>
> I can't understand why ELF format is necessary.
>
> I think the only necessary information is "what physical address
> regions are valid to read". This information is necessary for any
> sort of dump tools. (and must get it while the system is normal.)
> The Eric's /proc/cpumem idea sounds nice to me.
>

I agree. Format conversion should be done in healthy system separately
and we should restrict what to do while taking the dump as few as
possible. Conversion from just memory image to crash/lcrash format will
be very useful to use existing tools and experiences. I already have
such tool and (if my administration allows) I can make such tool open.
Let me do some paperwork.

Koichi Suzuki
NTT DATA Intellilink

Vivek Goyal

unread,

Feb 2, 2005, 4:20:09 AM2/2/05

to

On Tue, 2005-02-01 at 20:56, Eric W. Biederman wrote:
> Vivek Goyal <vgo...@in.ibm.com> writes:
>
> > Well, trying to put the already discussed ideas together. I was
> > planning to work on following design. Please comment.
> >
> > Crashed Kernel <-->Capture Kernel(or User Space) Interface:
> > ----------------------------------------------------------
> >
> > The whole idea is that Crash image is represented in ELF Core format.
> > These ELF Headers are prepared by kexec-tools user space and put in one
> > segment. Address of start of image is passed to the capture kernel(or
> > user space) using one command line (eg. crashimage=). Now either kernel
> > space or user space can parse the elf headers and extract required
> > information and export final kernel elf core image.
>
> Sounds sane. We need to make certain there is a checksum of that
> region but putting it in a separate segment should ensure that.
>
> I also think we need to look at the name crashimage= and see if we
> can find something more descriptive. But that is minor. Possibly
> elfcorehdr= We have a little while to think about that one before we
> are stuck.

"elfcorehdr=" also looks good.

>
> > > ebie...@xmission.com wrote:
> > > If we were using an ELF header I would include one PT_NOTE program
> > > header per cpu (Giving each cpu it's own area to mess around in).
> > > And I would use one PT_LOAD segment per possible memory zone.
> > > So in the worst case (current sgi altix) (MAX_NUMNODES=256,
> > > MAX_NR_ZONES=3, MAX_NR_CPUS=1024) 256*3+1024 = 1792 program
> > > headers. At 56 bytes per 64bit program header that is 100352 bytes
> > > or 98KiB. A little bit expensive. A tuned data structure with
> > > 64bit base and size would only consume 1792*16 = 28672 or 28KiB.
> >
> > If I prepare One elf header for each physical contiguous memory area (as
> > obtained from /proc/iomem) instead of per zone, then number of elf
> > headers will come down significantly.
>
> A clarification on terminology we are talking about struct Elf64_Phdr
> here. There is only one Elf header. That seems to be clear farther
> down.
>

Exactly. There shall be one Elf header for whole of the image. In
addition there will be one struct Elf64_Phdr, per contiguous physical
memory area. One Elf64_Phdr of PT_NOTE type for notes section and one
Elf64_Phdr for backup region.

> > I don't have any idea on number of
> > actual physically contiguous regions present per machine, but roughly
> > assuming it to be 1 per node, it will lead to 256 + 1024 = 1280 program
> > headers.At 56 bytes per 64 bit program header this will amount to 70KB.
> >
> > This is worst case estimate and on lower end machines this will require
> > much less a space. On machines as big as 1024 cpus, this should not be a
> > concern, as big machines come with big RAMs.
>
> Agreed. Size is not the primary issue. There is some clear waste
> but that is a secondary concern. Not performing a 1-1 mapping
> to the kernel data structures also seems to be a win, as the concepts
> are noticeably different.
>
> > Eric, do you still think that ELF headers are inappropriate to be passed
> > across interface boundary.
>
> I have serious concerns about the kernel generating the ELF headers
> and only delivering them after the kernel has crashed. Because
> then we run into questions of what information can be trusted. If we
> avoid that issue I am not too concerned.

I hope, all elf headers once prepared by kexec-tools need not to change
later (Cannot think of any piece of information which shall change
later). These shall be put in separate segment. And SHA-256 shall take
care of authenticity of information after crash.

>
> > Regarding Backup Region
> > -----------------------
> >
> > - Kexec user space does the reservation for backup region segment.
> > - Purgatory copies the backup data to backup region. (Already
> > implemented)
> > - A separate elf header is prepared to represent backed up memory
> > region. And "offset" field of this program header can contain the actual
> > physical address where backup contents are stored.
>
> I like that. I was thinking a virtual versus physical address
> separation. But using the offset field is much more appropriate,
> and it leaves us the potential of doing something nice like specifying
> the kernels virtual address later on. Looking exclusively at the
> offset field to know which memory addresses to dump sounds good.
> For now we should have virtual==physical==offset except for the
> backup region.

For notes section program header, virtual = physical = 0 and "offset"
shall point to crash_notes[], so that notes can directly be read by the
capture kernel (or user space).

Thanks
Vivek

Eric W. Biederman

unread,

Feb 2, 2005, 9:40:09 AM2/2/05

to

Itsuro Oda <o...@valinux.co.jp> writes:

> Hi,
>
> I can't understand why ELF format is necessary.

ELF format is not. However essentially the information an ELF
provides is. So using an ELF header to convey that information
is a sane choice of data structure.

> I think the only necessary information is "what physical address
> regions are valid to read". This information is necessary for any
> sort of dump tools. (and must get it while the system is normal.)
> The Eric's /proc/cpumem idea sounds nice to me.

Patches welcome.

Eric

Eric W. Biederman

unread,

Feb 2, 2005, 9:40:13 AM2/2/05

to

Koichi Suzuki <koi...@intellilink.co.jp> writes:

> I meant with kexec and dump hook, there could be many more things can be done in
> addition to full core dump. Initiating failover to other node will be one
> example. Starting with this hook, there must be many good ideas. So my idea
> is to make this hook general purpose, not for specific core dump tool.

Again that is what is has been implemented. A fully stand alone executable
that lives in an independent and reserved address in memory is jumped
to.

The goal in the generic kernel is to keep the code path to do that
as small and as simple as possible to reduce the chances of it being
mis-implemented, or the chances of attempting to use corrupted kernel
functionality.

Eric

Eric W. Biederman

unread,

Feb 2, 2005, 9:50:19 AM2/2/05

to

And the feedback begins :)

Itsuro Oda <o...@valinux.co.jp> writes:

> Hi,
>
> I don't like calling crash_kexec() directly in (ex.) panic().
> It should be call_dump_hook() (or something like this).
>
> I think the necessary modifications of the kernel is only:
> - insert the hooks that calls a dump function when crash occur

crash_kexec()

> - binding interface that binds a dump function to the hook
> (like register_dump_hook())

sys_kexec_load(...);

> - supply the information of valid physical address regions

/proc/iomem or possibly /proc/cpumem. At least until someone
actually implements hot plug memory support.

> (- maybe some existent functions and variables need to be exported ?)
>
> I think this makes any sort of dump functions can be implemented
> as a kernel module. I don't think it is best way that the "kexec based
> crashdump" is built in the kernel.

For people developing code outside of the kernel I can see where
this is a problem. Given the insane auditing requirements necessary
to get a reliable code path I don't see how not putting the implementation
in the kernel is sane. Anything that needs to be touched at that point
is core kernel functionality GPL_ONLY if it is exported at all.
Touching anything from a module at that point is not sane.

Basically the code path setup with crash_kexec is little more
than a jump instruction. And it should be audited and reduced
as much as possible. I don't see how you get simpler or what
piece of functionality could possibly improve by having multiple
implementations in kernel modules.

Eric W. Biederman

unread,

Feb 2, 2005, 10:30:21 AM2/2/05

to

Koichi Suzuki <koi...@intellilink.co.jp> writes:

> Itsuro Oda wrote:
> > Hi,
> > I can't understand why ELF format is necessary.
> > I think the only necessary information is "what physical address regions are
> > valid to read". This information is necessary for any
> > sort of dump tools. (and must get it while the system is normal.)
> > The Eric's /proc/cpumem idea sounds nice to me.
>
> I agree. Format conversion should be done in healthy system separately and we
> should restrict what to do while taking the dump as few as possible. Conversion
> from just memory image to crash/lcrash format will be very useful to use
> existing tools and experiences. I already have such tool and (if my
> administration allows) I can make such tool open. Let me do some paperwork.

The big part of the conversation that is happening right now is how
do we uncouple dependencies between the various parts as much as
possible. There is nothing here about format conversions except
as to convert weird kernel formats into a stable interface.

There are 3 pieces of code interacting.
1) The primary kernel that will call panic.
2) The kernel+initrd that takes over.
3) The user space that sets it all up (/sbin/kexec) while the primary
kernel is still in a sane state.

The goal is to make those 3 pieces as independent of each other as
reasonably possible.

So the kernel+initrd that captures a crash dump will live and execute
in a reserved area of memory. It needs to know which memory regions
are valid, and it needs to know small things like the final register
state of each cpu. For the set of valid memory regions it is the
intention to encode that as an array of ELF program headers. The
information of what the final register contents were will be encoded
as ELF notes. There will be one PT_NOTE segment per cpu that holds
the notes needed to encode a given cpu's final state. It really
does not matter to implementation that captures each cpu's final
register state which format we record the data in so using a format
designed not to change is not a problem. So all that needs
to be communicated to the kernel+initrd that captures a crash
dump is the location of an ELF header and it can figure out all of
the rest.

For the primary kernel except for remembering it's final cpu
register state as it dies it does nothing except jump to the
crash recover kernel. All of the interesting information will
be exported to user space.

/sbin/kexec is the glue that fills in the cracks. While
the primary kernel is in a sane state it sets everything up including
finding out which memory areas need to be looked at. And it stashes
it all in a reserved area of memory, that has never been the target
of DMA transfers.

The goal is to reduce the dependencies as much as possible. So
an old stable kernel can take a crash dump of a new buggy kernel.
And so that you don't have to be running the latest and greatest
user space simply to set everything up. Although it is still
better to require a user-space upgrade to cope with new
kernels than to require the crash capture kernel+initrd to
be upgraded.

Eric

Eric W. Biederman

unread,

Feb 2, 2005, 10:50:14 AM2/2/05

to

Vivek Goyal <vgo...@in.ibm.com> writes:

> On Tue, 2005-02-01 at 20:56, Eric W. Biederman wrote:
> > Vivek Goyal <vgo...@in.ibm.com> writes:
>

> "elfcorehdr=" also looks good.

Then let's go with that for now. It is not perfect but it seems
a little more self explanatory at first glance.

> > A clarification on terminology we are talking about struct Elf64_Phdr
> > here. There is only one Elf header. That seems to be clear farther
> > down.
> >
>
>
> Exactly. There shall be one Elf header for whole of the image. In
> addition there will be one struct Elf64_Phdr, per contiguous physical
> memory area. One Elf64_Phdr of PT_NOTE type for notes section and one
> Elf64_Phdr for backup region.

Actually if we are just pointing a kernel data structures we will
need multiple Elf64_Phdr of PT_NOTE. Each cpu has it's own
notes section and until the smoke clears we can't be confident
about what is going to wind up there or how densely those will
be packed. So collapsing everything into a single notes segment
needs to happen after we have switched to the crash capture kernel.

> > I have serious concerns about the kernel generating the ELF headers
> > and only delivering them after the kernel has crashed. Because
> > then we run into questions of what information can be trusted. If we
> > avoid that issue I am not too concerned.
>
>
> I hope, all elf headers once prepared by kexec-tools need not to change
> later (Cannot think of any piece of information which shall change
> later). These shall be put in separate segment. And SHA-256 shall take
> care of authenticity of information after crash.

That should work fine. We need to consider through throwing in an
extra note section with information like kernel version that
we can capture while the system is running.

> For notes section program header, virtual = physical = 0 and "offset"
> shall point to crash_notes[], so that notes can directly be read by the
> capture kernel (or user space).

I agree. But see my caveat. I think we should have one PT_NOTE
segment point at each element of the crash_notes[] array. I know
it is technically a violation of the ELF spec. But in this case
it makes sense. Since we can't guarantee that crash_notes will
be packed properly I don't know that we could reliably see more
than one cpu if we pointed a PT_NOTE header at the whole thing.

If it turns out that we can reliably point a single PT_NOTE header
at crash_notes so much the better but things are likely to be
more robust if we don't start with that assumption. That
at least allows us the freedom to capture some notes (like NT_UTSNAME)
before the kernel crashes.

Eric

Hirokazu Takahashi

unread,

Feb 3, 2005, 2:20:05 AM2/3/05

to

Hi Vivek and Eric,

IMHO, why don't we swap not only the contents of the top 640K
but also kernel working memory for kdump kernel?

I guess this approach has some good points.

1.Preallocating reserved area is not mandatory at boot time.
And the reserved area can be distributed in small pieces
like original kexec does.

2.Special linking is not required for kdump kernel.
Each kdump kernel can be linked in the same way,
where the original kernel exists.

Am I missing something?

physical memory
+-------+
| 640K ------------+
|.......| |
| | copy
+-------+ |
| | |
|original<-----+ |
|kernel | | |
| | | |
|.......| | |
| | | |
| | | |
| | swap |
| | | |
+-------+ | |
|reserved<----------+
|area | |
| | |
|kdump |<-----+
|kernel |
+-------+
| |
| |
| |
+-------+

Thaks,
Hirokazu Takahashi.

Itsuro Oda

unread,

Feb 3, 2005, 2:40:07 AM2/3/05

to

Hi,

On 02 Feb 2005 08:24:03 -0700

ebie...@xmission.com (Eric W. Biederman) wrote:
>

> So the kernel+initrd that captures a crash dump will live and execute
> in a reserved area of memory. It needs to know which memory regions
> are valid, and it needs to know small things like the final register
> state of each cpu.

Exactly.

Please let me clarify what you are going to.
1) standard kernel: reserve a small contigous area for a dump kernel
(this is not changed as the current code)
2) standard kernel: export the information of valid physical memory
regions. (/proc/iomem or /proc/cpumem etc.)
3) kexec (system call?): store the information of valid physical memory
regions as ELF program header to the reserved area (mentioned 1)).
4) standard kernel: when a panic occur, append (ex.) the register
information as ELF note after the memory information (if necessary).
and jump new kernel
5) dump kernel: export all valid physical memory (and saved register
information) to the user. (as /dev/oldmem /proc/vmcore ?)

Is this correct ? one question: how the dump kernel know the saved
area of ELF headers ?

one more question: I don't understand what the 640K backup area is.
Please let me know why it is necessary.

Thanks.
--
Itsuro ODA <o...@valinux.co.jp>

Vivek Goyal

unread,

Feb 3, 2005, 3:20:08 AM2/3/05

to

Hi,

On Thu, 2005-02-03 at 12:32, Hirokazu Takahashi wrote:
> Hi Vivek and Eric,
>
> IMHO, why don't we swap not only the contents of the top 640K
> but also kernel working memory for kdump kernel?

Initial patches of kdump had adopted the same approach but given the
fact devices are not stopped during transition to new kernel after a
panic, it carried inherent risk of some DMA going on and corrupting the
new kernel/data structures. Hence the idea of running the kernel from a
reserved location came up. This should be DMA safe as long as DMA is not
misdirected.

Thanks
Vivek

Eric W. Biederman

unread,

Feb 3, 2005, 4:10:07 AM2/3/05

to

Itsuro Oda <o...@valinux.co.jp> writes:

> Hi,
>
> On 02 Feb 2005 08:24:03 -0700
> ebie...@xmission.com (Eric W. Biederman) wrote:
> >
> > So the kernel+initrd that captures a crash dump will live and execute
> > in a reserved area of memory. It needs to know which memory regions
> > are valid, and it needs to know small things like the final register
> > state of each cpu.
>
> Exactly.
>
> Please let me clarify what you are going to.
> 1) standard kernel: reserve a small contigous area for a dump kernel
> (this is not changed as the current code)
> 2) standard kernel: export the information of valid physical memory
> regions. (/proc/iomem or /proc/cpumem etc.)
> 3) kexec (system call?): store the information of valid physical memory
> regions as ELF program header to the reserved area (mentioned 1)).

A better description is probably make a list of memory regions
using an ELF header data structure in user space.
Use sys_kexec_load to put that list the dump kernel and a little
big of glue code in the reserved area. The glue code includes
a hash of all of everything so it can all be validated before
use.

> 4) standard kernel: when a panic occur, append (ex.) the register
> information as ELF note after the memory information (if necessary).
> and jump new kernel

Record the register information as ELF notes in a per cpu data
area. The per cpu data areas are known and enumerated in
the list of memory regions. The kernel knows nothing about
the ELF header etc.

> 5) dump kernel: export all valid physical memory (and saved register
> information) to the user. (as /dev/oldmem /proc/vmcore ?)

Or in user space, by just mmaping /dev/mem. That is part of the
current conversation. The only real point for putting that code in
the kernel (besides momentum) is it is a cheap way to get the exact
data structures of the kernel you are using. But since:
(a) it does not look like any primary kernel data structures need to
be examined.
(b) even simple compile options like SMP/NOSMP are enough to change
the layout of the data structures.
I think there is a pretty good case for moving all of the work to
user space. But you still need a kernel that loads and
runs in the reserved area.

> Is this correct ? one question: how the dump kernel know the saved
> area of ELF headers ?

A command line parameter will be passed. Probably
elfcorehdr=xxx

> one more question: I don't understand what the 640K backup area is.
> Please let me know why it is necessary.

In practice I think we can kill it on x86. It is necessary (at least
a subset of it is) if we want to boot a SMP kernel. As cpu must
start running code in the first 1M of the address space. In addition
some architectures have exceptions vectors and or other data
structures at fixed locations in memory so in the general case a
backup area is required. So building the infrastructure to handle
backup areas is needed even, even if we later stop using it on
x86.

The other reason for the 640K backup area is the IBM guys were having
problems without it. The fact that you don't need it is a good
indication that it is unnecessary.

Eric

Eric W. Biederman

unread,

Feb 3, 2005, 4:30:15 AM2/3/05

to

Hirokazu Takahashi <ta...@valinux.co.jp> writes:

> Hi Vivek and Eric,
>
> IMHO, why don't we swap not only the contents of the top 640K
> but also kernel working memory for kdump kernel?
>
> I guess this approach has some good points.
>
> 1.Preallocating reserved area is not mandatory at boot time.
> And the reserved area can be distributed in small pieces
> like original kexec does.
>
> 2.Special linking is not required for kdump kernel.
> Each kdump kernel can be linked in the same way,
> where the original kernel exists.
>
> Am I missing something?

Preallocating the reserved area is largely to keep it from
being the target of DMA accesses. Since we are not able
to shutdown any of the drivers in the primary kernel running
in a normal swath of memory sounds like a good way to get
yourself stomped at the worst possible time.

In addition we get to avoid running a lot of code in the
panic path if we are jumping to a contiguous region of memory
with everything already setup.

To some extent this is a contest who has the better imagination
for things that can go wrong. Real life on dying hardware and
kernels, or the programmers writing the diagnostic code.

But if it is a gamble you are willing to take it is quite
feasible to use the reserved region for what you are
proposing and you could run a standard kernel.

The other reason for running out of the reserved region is that
it actually requires less memory reserved. Every byte you backup
needs to have a reserved area of memory to hold it. And if you are
also going to fill that with meaningful content you need another
byte to hold the data. So using a stock kernel probably requires
2/3 more memory.

Eric

Hirokazu Takahashi

unread,

Feb 3, 2005, 4:50:09 AM2/3/05

to

Hi Vivek,

> > Hi Vivek and Eric,
> >
> > IMHO, why don't we swap not only the contents of the top 640K
> > but also kernel working memory for kdump kernel?
>
>
> Initial patches of kdump had adopted the same approach but given the
> fact devices are not stopped during transition to new kernel after a
> panic, it carried inherent risk of some DMA going on and corrupting the
> new kernel/data structures. Hence the idea of running the kernel from a
> reserved location came up. This should be DMA safe as long as DMA is not
> misdirected.

I see, that makes sense.
But I'm not sure yet that it's safe to access the top of 640MB.
I wonder how kmalloc(GFP_DMA) works in a kdump kernel.

Thanks,
Hirokazu Takahashi.

Eric W. Biederman

unread,

Feb 3, 2005, 5:20:10 AM2/3/05

to

Hirokazu Takahashi <ta...@valinux.co.jp> writes:

> Hi Vivek,
>
> > > Hi Vivek and Eric,
> > >
> > > IMHO, why don't we swap not only the contents of the top 640K
> > > but also kernel working memory for kdump kernel?
> >
> >
> > Initial patches of kdump had adopted the same approach but given the
> > fact devices are not stopped during transition to new kernel after a
> > panic, it carried inherent risk of some DMA going on and corrupting the
> > new kernel/data structures. Hence the idea of running the kernel from a
> > reserved location came up. This should be DMA safe as long as DMA is not
> > misdirected.
>
> I see, that makes sense.
> But I'm not sure yet that it's safe to access the top of 640MB.

640K?

> I wonder how kmalloc(GFP_DMA) works in a kdump kernel.

All that happens there is a one line change to vmlinux.lds.S that
causes the kernel to live at a different physical and virtual
address. So everything works as normal.

I do agree that it is risky to use the first 640K for normal work.
But on the list of things to fix it is a minor war, and even if we
back up that region of memory we don't need to use it.

There are still remain a lot of code reviews to ensure the code is
generally safe.

Eric

Hirokazu Takahashi

unread,

Feb 3, 2005, 5:30:16 AM2/3/05

to

Hi Eric,

> > Hi Vivek and Eric,
> >
> > IMHO, why don't we swap not only the contents of the top 640K
> > but also kernel working memory for kdump kernel?
> >
> > I guess this approach has some good points.
> >
> > 1.Preallocating reserved area is not mandatory at boot time.
> > And the reserved area can be distributed in small pieces
> > like original kexec does.
> >
> > 2.Special linking is not required for kdump kernel.
> > Each kdump kernel can be linked in the same way,
> > where the original kernel exists.
> >
> > Am I missing something?
>
> Preallocating the reserved area is largely to keep it from
> being the target of DMA accesses. Since we are not able
> to shutdown any of the drivers in the primary kernel running
> in a normal swath of memory sounds like a good way to get
> yourself stomped at the worst possible time.

So what do you think my another idea?

I think we can always make a kdump kernel mapped to the same virtual
address. So we will be free from caring about the physical address
where the kdump kernel is loaded.

I believe the memsection functionality which LHMS project is working
on would help this.

+
|
|
(user space)
|
|
physical | virtual
memory | space
+ ------------ +
| |
| |
| |
+ ------------.+
original | . | map kdump kernel here
kernel | . |
| . |
| . .+
+ . . |
| . . |
+ . |
kdump | . |
kernel | . |
| . |
+ |
| |
| |
| |

Thanks,
Hirokazu Takahashi.

Eric W. Biederman

unread,

Feb 3, 2005, 6:20:09 AM2/3/05

to

Hirokazu Takahashi <ta...@valinux.co.jp> writes:

> Hi Eric,
>
> > > Hi Vivek and Eric,
> > >
> > > IMHO, why don't we swap not only the contents of the top 640K
> > > but also kernel working memory for kdump kernel?
> > >
> > > I guess this approach has some good points.
> > >
> > > 1.Preallocating reserved area is not mandatory at boot time.
> > > And the reserved area can be distributed in small pieces
> > > like original kexec does.
> > >
> > > 2.Special linking is not required for kdump kernel.
> > > Each kdump kernel can be linked in the same way,
> > > where the original kernel exists.
> > >
> > > Am I missing something?
> >
> > Preallocating the reserved area is largely to keep it from
> > being the target of DMA accesses. Since we are not able
> > to shutdown any of the drivers in the primary kernel running
> > in a normal swath of memory sounds like a good way to get
> > yourself stomped at the worst possible time.
>
> So what do you think my another idea?

I have proposed it. I think ia64 already does that.
It has been pointed that the PowerPC kernel occasionally runs
with the mmu turned off. So it is not a technique the is 100%
portable.

> I think we can always make a kdump kernel mapped to the same virtual
> address. So we will be free from caring about the physical address
> where the kdump kernel is loaded.
>
> I believe the memsection functionality which LHMS project is working
> on would help this.

You don't need anything fancy except to build the page tables
during bootup. However there are a few potential gotchas
with respect to using large pages, that can give 4MiB or
greater alignment restrictions on the kernel. Code wise
the gotcha is moving the kernel's .text section into what
is essentially the vmalloc portion of the address space.
For x86_64 the kernels virtual address is already decoupled from the
physical addresses, so it is probably easier.

Most of this just results in easier management between the pieces.
Which is a good thing. However at the moment I don't think it
simplifies any of the core problems. I still need to reserve
a large hunk of physical address space early on before any
DMA transactions are setup to hold the new kernel.

So while I am happy to see patches that improve this I don't
actually care right now.

Eric

Vivek Goyal

unread,

Feb 3, 2005, 9:10:06 AM2/3/05

to

On Wed, 2005-02-02 at 21:12, Eric W. Biederman wrote:
> Vivek Goyal <vgo...@in.ibm.com> writes:
>
> > On Tue, 2005-02-01 at 20:56, Eric W. Biederman wrote:
> > > Vivek Goyal <vgo...@in.ibm.com> writes:
> >
> > "elfcorehdr=" also looks good.
>
> Then let's go with that for now. It is not perfect but it seems
> a little more self explanatory at first glance.
> > > A clarification on terminology we are talking about struct Elf64_Phdr
> > > here. There is only one Elf header. That seems to be clear farther
> > > down.
> > >
> >
> >
> > Exactly. There shall be one Elf header for whole of the image. In
> > addition there will be one struct Elf64_Phdr, per contiguous physical
> > memory area. One Elf64_Phdr of PT_NOTE type for notes section and one
> > Elf64_Phdr for backup region.
>
> Actually if we are just pointing a kernel data structures we will
> need multiple Elf64_Phdr of PT_NOTE. Each cpu has it's own
> notes section and until the smoke clears we can't be confident
> about what is going to wind up there or how densely those will
> be packed. So collapsing everything into a single notes segment
> needs to happen after we have switched to the crash capture kernel.

Sounds good. So there shall be a PT_NOTE type program header per cpu.
And these headers can be collapsed into one PT_NOTE type header later.

Itsuro Oda

unread,

Feb 3, 2005, 6:30:43 PM2/3/05

to

Hi,

On 03 Feb 2005 02:00:51 -0700

ebie...@xmission.com (Eric W. Biederman) wrote:

> A better description is probably make a list of memory regions
> using an ELF header data structure in user space.
> Use sys_kexec_load to put that list the dump kernel and a little
> big of glue code in the reserved area. The glue code includes
> a hash of all of everything so it can all be validated before
> use.

I see. The data structure is put on a part of loading kernel's data.

> Record the register information as ELF notes in a per cpu data
> area. The per cpu data areas are known and enumerated in
> the list of memory regions. The kernel knows nothing about
> the ELF header etc.
>

I see.

> > 5) dump kernel: export all valid physical memory (and saved register
> > information) to the user. (as /dev/oldmem /proc/vmcore ?)
>
> Or in user space, by just mmaping /dev/mem. That is part of the
> current conversation. The only real point for putting that code in
> the kernel (besides momentum) is it is a cheap way to get the exact
> data structures of the kernel you are using. But since:
> (a) it does not look like any primary kernel data structures need to
> be examined.
> (b) even simple compile options like SMP/NOSMP are enough to change
> the layout of the data structures.
> I think there is a pretty good case for moving all of the work to
> user space. But you still need a kernel that loads and
> runs in the reserved area.
>

I don't make sense. what do you mean ?

What we want to do when the system is crashed is storing the whole
physical memory (and saved register information for x86 arch) to
some place (ex. a disk partition) for later analysis.
So the basic requirments to the dump kernel is that:
* supply a method to access whole (valid) physical memory.
* supply a method to access the saved register information.

Does the kdump meet this requirment ?

(I am not interesting to /proc/vmcore. Constructing the vmcore
image is area of analysis tools. not kernel's task.)

Thanks.
--
Itsuro ODA <o...@valinux.co.jp>

-

Itsuro Oda

unread,

Feb 3, 2005, 7:30:12 PM2/3/05

to

Hi,

On 02 Feb 2005 07:45:11 -0700

ebie...@xmission.com (Eric W. Biederman) wrote:

>
> And the feedback begins :)
>
> Itsuro Oda <o...@valinux.co.jp> writes:
>
> > Hi,
> >
> > I don't like calling crash_kexec() directly in (ex.) panic().
> > It should be call_dump_hook() (or something like this).
> >
> > I think the necessary modifications of the kernel is only:
> > - insert the hooks that calls a dump function when crash occur
> crash_kexec()
> > - binding interface that binds a dump function to the hook
> > (like register_dump_hook())
> sys_kexec_load(...);

For example there are pepole who want to execute a built in kernel
debugger when the system is crashed. or there are pepole who
believe the diskdump is the best dump tool :-)

So I think a sort of hook is better than calling crash_kexec
directly. (May I make a patch ?)

Thanks.
--
Itsuro ODA <o...@valinux.co.jp>

-

Eric W. Biederman

unread,

Feb 3, 2005, 8:00:24 PM2/3/05

to

Itsuro Oda <o...@valinux.co.jp> writes:

Yes, the discussion in this area is what is the best way to implement
this requirement. How much should be in the kernel and how much
should be in user space.

At the moment things are broken but should be fixed shortly.
So what has been implemented are /dev/oldmem which provides access
to the old memory. And /proc/vmcore which provides both the old
memory and the register information.

> (I am not interesting to /proc/vmcore. Constructing the vmcore
> image is area of analysis tools. not kernel's task.)

There is a fine line there, as a simple ELF core dump has just enough
information to describe discontiguous memory, and to have an out of
band channel for register information. Adding anything extra like
virtual addresses that match the kernel should be left for the
crash dump analysis tools.

In code that is currently in the mainstream kernel /dev/mem can
mmap any area of memory that is not used by the kernel as ram.
So what I believe we will end up is that /sbin/kexec (user space)
will prepare an ELF header (data) that describes the memory regions
and details where to find the kernels register information. The
address of that ELF header will be passed to the crash dump
capture kernel and user space combination. The something
(probably a user space program reading /dev/mem) will look
at the ELF header and save the already prepared ELF core
dump to disk. Possibly doing little things like merging
the MAX_NR_CPUS note segments into one so it actually conforms
to the ELF spec.

This thread started as the design discussion before finishing
that part of the implementation. The proof of concept
implementations have happened. We have all seen this kind
of functionality implemented. Now is the time to come up
with a good solid design that can be maintained and merged
into the mainline kernel and distros.

So thank you for ask questions, it means we have a better chance
of getting a solid design and a design that those people who
care about this functionality can use. And with a little luck
we can all wind up on agreeing on the general principles. You came in
a little late to this conversation so a lot of details have been
settled, but if you have a good argument for doing something another
way we can certainly look at that.

Eric

Eric W. Biederman

unread,

Feb 3, 2005, 9:20:09 PM2/3/05

to

Itsuro Oda <o...@valinux.co.jp> writes:

> Hi,
>
> On 02 Feb 2005 07:45:11 -0700
> ebie...@xmission.com (Eric W. Biederman) wrote:
>
> >
> > And the feedback begins :)
> >
> > Itsuro Oda <o...@valinux.co.jp> writes:
> >
> > > Hi,
> > >
> > > I don't like calling crash_kexec() directly in (ex.) panic().
> > > It should be call_dump_hook() (or something like this).
> > >
> > > I think the necessary modifications of the kernel is only:
> > > - insert the hooks that calls a dump function when crash occur
> > crash_kexec()
> > > - binding interface that binds a dump function to the hook
> > > (like register_dump_hook())
> > sys_kexec_load(...);
>
> For example there are pepole who want to execute a built in kernel
> debugger when the system is crashed. or there are pepole who
> believe the diskdump is the best dump tool :-)
>
> So I think a sort of hook is better than calling crash_kexec
> directly. (May I make a patch ?)

The prevalent feeling I have heard from kernel developers and
and my personal feeling as well is that after a kernel has called
panic you can't trust it. Which means anything running in the kernel
itself is suspect.

The crash_kexec() hooks enables everything that does not get linked into
the kernel. So I don't feel a hook in the panic path is necessary
nor do I feel that it is wise, especially with no in-kernel users.

Plus the worst part about a hook in the panic path is that it is
inherently racy. Keeping the crash_kexec() code from blocking or
being racy has been a challenge. And I still think that entire code
path needs a review and some more code tweaks to remove races.

If someone else wants a hook in the panic path they can add their own
hook, and make their own case for why it is needed.

Eric

Itsuro Oda

unread,

Feb 3, 2005, 9:40:07 PM2/3/05

to

Hi,

On Fri, 04 Feb 2005 08:18:56 +0900
Itsuro Oda <o...@valinux.co.jp> wrote:

>
> > > 5) dump kernel: export all valid physical memory (and saved register
> > > information) to the user. (as /dev/oldmem /proc/vmcore ?)
> >
> > Or in user space, by just mmaping /dev/mem. That is part of the
> > current conversation. The only real point for putting that code in
> > the kernel (besides momentum) is it is a cheap way to get the exact
> > data structures of the kernel you are using. But since:
> > (a) it does not look like any primary kernel data structures need to
> > be examined.
> > (b) even simple compile options like SMP/NOSMP are enough to change
> > the layout of the data structures.
> > I think there is a pretty good case for moving all of the work to
> > user space. But you still need a kernel that loads and
> > runs in the reserved area.
> >
> I don't make sense. what do you mean ?
>

"I don't make sense." should be "It does not make sense."
sorry. I'm not familiar with English.

Hirokazu Takahashi

unread,

Feb 4, 2005, 5:20:13 AM2/4/05

to

Hi,

> > Hi Eric,
> >
> > > > Hi Vivek and Eric,
> > > >
> > > > IMHO, why don't we swap not only the contents of the top 640K
> > > > but also kernel working memory for kdump kernel?
> > > >
> > > > I guess this approach has some good points.
> > > >
> > > > 1.Preallocating reserved area is not mandatory at boot time.
> > > > And the reserved area can be distributed in small pieces
> > > > like original kexec does.
> > > >
> > > > 2.Special linking is not required for kdump kernel.
> > > > Each kdump kernel can be linked in the same way,
> > > > where the original kernel exists.
> > > >
> > > > Am I missing something?
> > >
> > > Preallocating the reserved area is largely to keep it from
> > > being the target of DMA accesses. Since we are not able
> > > to shutdown any of the drivers in the primary kernel running
> > > in a normal swath of memory sounds like a good way to get
> > > yourself stomped at the worst possible time.
> >
> > So what do you think my another idea?
>
> I have proposed it. I think ia64 already does that.
> It has been pointed that the PowerPC kernel occasionally runs
> with the mmu turned off. So it is not a technique the is 100%
> portable.

I see you have.
And MIPS CPUs doesn't allow kernel pages to be remapped either.

> > I think we can always make a kdump kernel mapped to the same virtual
> > address. So we will be free from caring about the physical address
> > where the kdump kernel is loaded.
> >
> > I believe the memsection functionality which LHMS project is working
> > on would help this.
>
> You don't need anything fancy except to build the page tables
> during bootup. However there are a few potential gotchas
> with respect to using large pages, that can give 4MiB or
> greater alignment restrictions on the kernel. Code wise
> the gotcha is moving the kernel's .text section into what
> is essentially the vmalloc portion of the address space.
> For x86_64 the kernels virtual address is already decoupled from the
> physical addresses, so it is probably easier.

I know we can place the kernel in any address though there
exist some exceptions.

I know mapping kernel pages to the same virtual address only helps
to avoid caring about physical addresses or vmalloc'ed addresses
when linking the kernel. I think it wouldn't be bad idea in many
architectures. I prefer it rather than linking the kernel for each
system.

> Most of this just results in easier management between the pieces.
> Which is a good thing. However at the moment I don't think it
> simplifies any of the core problems. I still need to reserve
> a large hunk of physical address space early on before any
> DMA transactions are setup to hold the new kernel.

I agree that my idea is not essential at the moment.

> So while I am happy to see patches that improve this I don't
> actually care right now.

ok.

> Eric
>

Thanks,
Hirokazu Takahashi.

Eric W. Biederman

unread,

Feb 4, 2005, 6:30:19 AM2/4/05

to

Hirokazu Takahashi <ta...@valinux.co.jp> writes:

> Hi,
>
> > > Hi Eric,
> > >

> I see you have.
> And MIPS CPUs doesn't allow kernel pages to be remapped either.

I guess I should add to be relocatable in the general case most
likely requires running a PIC dynamic linker at kernel startup.
If none of the rest of the kernel is built PIC and the relocation
table is not too big we might be able to convince people to implement
it generally.

At least that is one technique for generating a PIC kernel that I
have not explored fully.

> > You don't need anything fancy except to build the page tables
> > during bootup. However there are a few potential gotchas
> > with respect to using large pages, that can give 4MiB or
> > greater alignment restrictions on the kernel. Code wise
> > the gotcha is moving the kernel's .text section into what
> > is essentially the vmalloc portion of the address space.
> > For x86_64 the kernels virtual address is already decoupled from the
> > physical addresses, so it is probably easier.
>
> I know we can place the kernel in any address though there
> exist some exceptions.
>
> I know mapping kernel pages to the same virtual address only helps
> to avoid caring about physical addresses or vmalloc'ed addresses
> when linking the kernel. I think it wouldn't be bad idea in many
> architectures. I prefer it rather than linking the kernel for each
> system.

Agreed. Although I suspect most architectures will have a region
that will work for most users.

> > Most of this just results in easier management between the pieces.
> > Which is a good thing. However at the moment I don't think it
> > simplifies any of the core problems. I still need to reserve
> > a large hunk of physical address space early on before any
> > DMA transactions are setup to hold the new kernel.
>
> I agree that my idea is not essential at the moment.
>
> > So while I am happy to see patches that improve this I don't
> > actually care right now.
>
> ok.

The one part I do request is that if you build such a kernel that
you figure a way to get it's ELF header of type ET_DYN. So it
does not require a magic loader to load it.

I have recently patched both etherboot and /sbin/kexec to accept
that kind of binary :)

Eric

Eric W. Biederman

unread,

Feb 4, 2005, 7:10:13 AM2/4/05

to

ebie...@xmission.com (Eric W. Biederman) writes:

> Hirokazu Takahashi <ta...@valinux.co.jp> writes:
> > > Most of this just results in easier management between the pieces.
> > > Which is a good thing. However at the moment I don't think it
> > > simplifies any of the core problems. I still need to reserve
> > > a large hunk of physical address space early on before any
> > > DMA transactions are setup to hold the new kernel.
> >
> > I agree that my idea is not essential at the moment.
> >
> > > So while I am happy to see patches that improve this I don't
> > > actually care right now.
> >
> > ok.

Thinking about this some more this does have a significant aspect
on the design. For architectures that support this, on the
primary kernel the command line option becomes:
crashkernel=size instead of crashkernel=size@location.
Which means the kernel needs to call alloc_bootmem instead
of reserve_bootmem. So it results in a primary kernel implementation
difference.

In addition if we really can push all of the dump specific
functionality into user space as it appears we can, this allows a
generic kernel to be used for the crash dump process. It will
probably still be a special hardened build where reliability is
more important than performance. So that any micro hit we take in
performance by modifying __pa() and __va() will be irrelevant.

I like it.

I have already demonstrated that there is a general technique that
any architecture can use to build a kernel that runs at a non-default
address. So for the architectures that cannot build a PIC kernel
there is still a proven solution available, it simply will not
be as nice to manage.

x86_64 should pretty straight forward. i386 will be a little more
difficult but doable.

Patches are still welcome.

Itsuro Oda

unread,

Feb 16, 2005, 4:00:20 AM2/16/05

to

Hi, Eric and all

Attached is an implementation of /proc/cpumem.
/proc/cpumem shows the valid physical memory ranges.

* i386 and x86_64
* implement valid_phys_addr_range() and use it.
(the first argument of the i386 version is little uncomfortable.)
* /dev/mem of the i386 version should be mofified. but not yet.

example: amd64 8GB Mem
# cat /proc/cpumem
0000000000000000 000000000009b800
0000000000100000 00000000fbe70000
0000000100000000 0000000100000000
#
start address and size. hex digit.

Any comments, recomendations and suggestions are welcom.

BTW, does not kexec/kdump run on 2.6.11-rc3-mm2 ?
How do I get and examine the latest kexec/kdump ?

Thanks.
--
Itsuro ODA <o...@valinux.co.jp>

---
--- linux-2.6.11-rc3-mm2/drivers/char/mem.c 2005-02-16 15:36:31.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/drivers/char/mem.c 2005-02-16 23:32:15.244876816 +0900
@@ -25,6 +25,9 @@
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/crash_dump.h>
+#include <linux/bootmem.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>

#include <asm/uaccess.h>
#include <asm/io.h>
@@ -759,3 +762,125 @@
}

fs_initcall(chr_dev_init);
+
+#ifdef CONFIG_PROC_FS
+/*
+ * /proc/cpumem: show valid physical address range
+ */
+struct cpumem_info {
+ unsigned long long addr;
+ unsigned long long size;
+};
+
+static void *cpumem_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct cpumem_info *p = m->private;
+ unsigned long long end = (unsigned long long)max_pfn << PAGE_SHIFT;
+ unsigned long long addr;
+ size_t size;
+ int found = 0;
+
+ (*pos)++;
+
+ if (p->addr >= end) {
+ return NULL;
+ }
+
+ /* always start page boundary */
+ addr = ((p->addr + p->size + PAGE_SIZE - 1) >> PAGE_SHIFT) << PAGE_SHIFT;
+ size = 0xf0000000;
+
+ while (addr < end) {
+ if (valid_phys_addr_range(addr, &size)) {
+ if (!found) {
+ found = 1;
+ p->addr = addr;
+ p->size = size;
+ } else {
+ p->size += size;
+ }
+ addr += size;
+ size = 0xf0000000;
+ } else {
+ if (found) {
+ return p;
+ }
+ addr += PAGE_SIZE;
+ }
+ }
+
+ return found ? p : NULL;
+}
+
+static void *cpumem_start(struct seq_file *m, loff_t *pos)
+{
+ struct cpumem_info *p = m->private;
+ loff_t n = 0;
+
+ p->addr = 0;
+ p->size = 0;
+
+ while (n <= *pos) {
+ if (!cpumem_next(m, NULL, &n)) {
+ return NULL;
+ }
+ }
+
+ return p;
+}
+
+static void cpumem_stop(struct seq_file *m, void *v)
+{
+}
+
+static int cpumem_show(struct seq_file *m, void *v)
+{
+ struct cpumem_info *p = m->private;
+ unsigned long long end = (unsigned long long)max_pfn << PAGE_SHIFT;
+
+ if (p->addr < end) {
+ seq_printf(m, "%016llx %016llx\n", p->addr, p->size);
+ }
+ return 0;
+}
+
+struct seq_operations cpumem_op = {
+ .start = cpumem_start,
+ .next = cpumem_next,
+ .stop = cpumem_stop,
+ .show = cpumem_show
+};
+
+static int cpumem_open(struct inode *inode, struct file *file)
+{
+ int res = seq_open(file, &cpumem_op);
+ if (!res) {
+ struct seq_file *m = file->private_data;
+ m->private = kmalloc(sizeof(struct cpumem_info), GFP_KERNEL);
+ if (!m->private) {
+ seq_release(inode, file);
+ return -ENOMEM;
+ }
+ }
+ return res;
+}
+
+static struct file_operations proc_cpumem_operations = {
+ .open = cpumem_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release_private
+};
+
+static int __init cpumem_init(void)
+{
+ struct proc_dir_entry *entry;
+
+ entry = create_proc_entry("cpumem", 0, NULL);
+ if (entry) {
+ entry->proc_fops = &proc_cpumem_operations;
+ }
+ return 0;
+}
+__initcall(cpumem_init);
+#endif /* CONFIG_PROC_FS */

---
--- linux-2.6.11-rc3-mm2/arch/i386/mm/init.c 2005-02-16 15:36:29.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/arch/i386/mm/init.c 2005-02-16 23:32:29.499709752 +0900
@@ -248,6 +248,47 @@
return 0;
}

+int valid_phys_addr_range(unsigned long long phys_addr, size_t *size)
+{
+ int i;
+ unsigned long long addr, end;
+ efi_memory_desc_t *md;
+
+ if (efi_enabled) {
+ for (i = 0; i < memmap.nr_map; i++) {
+ md = &memmap.map[i];
+ if (!is_available_memory(md)) {
+ continue;
+ }
+ addr = md->phys_addr;
+ end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT);
+ if ((phys_addr >= addr) && (phys_addr < end)) {
+ if (*size > end - phys_addr) {
+ *size = end - phys_addr;
+ }
+ return 1;
+ }
+ }
+ return 0;
+ }
+
+ for (i = 0; i < e820.nr_map; i++) {
+ if (e820.map[i].type != E820_RAM) {
+ continue;
+ }
+ addr = e820.map[i].addr;
+ end = e820.map[i].addr + e820.map[i].size;
+ if ((phys_addr >= addr) && (phys_addr < end)) {
+ if (*size > end - phys_addr) {
+ *size = end - phys_addr;
+ }
+ return 1;
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL(valid_phys_addr_range);
+
#ifdef CONFIG_HIGHMEM
pte_t *kmap_pte;
pgprot_t kmap_prot;

---
--- linux-2.6.11-rc3-mm2/include/asm-i386/io.h 2004-12-25 06:35:40.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/include/asm-i386/io.h 2005-02-16 23:36:24.454991120 +0900
@@ -90,6 +90,12 @@
*/
#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)

+/*
+ * for /dev/mem
+ */
+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
+extern int valid_phys_addr_range(unsigned long long, size_t *);
+
extern void __iomem * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);

/**

---
--- linux-2.6.11-rc3-mm2/arch/x86_64/mm/init.c 2005-02-16 15:36:30.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/arch/x86_64/mm/init.c 2005-02-16 16:23:08.000000000 +0900
@@ -22,6 +22,7 @@
#include <linux/pagemap.h>
#include <linux/bootmem.h>
#include <linux/proc_fs.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/system.h>
@@ -395,6 +396,27 @@
return 0;
}

+int valid_phys_addr_range(unsigned long phys_addr, size_t *size)
+{
+ int i;
+ unsigned long end;
+
+ for (i = 0; i < e820.nr_map; i++) {
+ if (e820.map[i].type != E820_RAM) {
+ continue;
+ }
+ end = e820.map[i].addr + e820.map[i].size;
+ if (phys_addr >= e820.map[i].addr && phys_addr < end) {
+ if (*size > end - phys_addr) {
+ *size = end - phys_addr;
+ }
+ return 1;
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL(valid_phys_addr_range);
+
extern int swiotlb_force;

/*

---
--- linux-2.6.11-rc3-mm2/include/asm-x86_64/io.h 2005-02-16 15:36:12.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/include/asm-x86_64/io.h 2005-02-16 16:23:59.000000000 +0900
@@ -123,6 +123,9 @@
{
return __va(address);
}
+
+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
+extern int valid_phys_addr_range(unsigned long, size_t *);
#endif

/*

---

Eric W. Biederman

unread,

Feb 16, 2005, 9:10:11 AM2/16/05

to

Itsuro Oda <o...@valinux.co.jp> writes:

> Hi, Eric and all
>
> Attached is an implementation of /proc/cpumem.
> /proc/cpumem shows the valid physical memory ranges.

Interesting. My imagination when I proposed this
was something based on struct resource that works
like /proc/iomem on x86 but can be meaningfully
be used on systems with where ram lives in a separate
address space from io device memory.

> example: amd64 8GB Mem
> # cat /proc/cpumem
> 0000000000000000 000000000009b800
> 0000000000100000 00000000fbe70000
> 0000000100000000 0000000100000000
> #
> start address and size. hex digit.

The lack of a type field looses a fair amount of functionality compared
to /proc/iomem. In particular you can't see where the ACPI data is.

The other direction something like this can go is to dump
the data structures in linux/mmzone.h

> Any comments, recomendations and suggestions are welcom.
>
> BTW, does not kexec/kdump run on 2.6.11-rc3-mm2 ?
> How do I get and examine the latest kexec/kdump ?

I'm not quite certain what is happening.

I have been playing with kexec user space a little bit and a new
development release is at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz

I have written a first pass at a user space core dump generator,
using /dev/mem. /sbin/kexec still needs some work to prepare
the ELF headers before a crash.

Eric

YAMAMOTO Takashi

unread,

Feb 16, 2005, 7:30:09 PM2/16/05

to

hi,

> + while (addr < end) {
> + if (valid_phys_addr_range(addr, &size)) {
> + if (!found) {
> + found = 1;
> + p->addr = addr;
> + p->size = size;
> + } else {
> + p->size += size;
> + }
> + addr += size;
> + size = 0xf0000000;
> + } else {
> + if (found) {
> + return p;
> + }
> + addr += PAGE_SIZE;
> + }
> + }

doesn't this loop take very long time if you have a large hole?

i'd suggest to change valid_phys_addr_range to fill &size even when
it returns false, so that caller can skip the hole efficiently.

YAMAMOTO Takashi

Itsuro Oda

unread,

Feb 16, 2005, 7:50:09 PM2/16/05

to

Hi Eric,

> The lack of a type field looses a fair amount of functionality compared
> to /proc/iomem. In particular you can't see where the ACPI data is.

Hmm, restricting System RAM only may be too pessimistic.
(One of motivations of this work is for using /dev/mem safely.
"dd if=/dev/mem of=xxx" causes panic on my amd64(8GB mem) machine
since reading from address around 0xfe000000 causes a machine
check. hmm, this area is marked as "reserved". not ACPI area.
ACPI area can be read.)

Ok, I will add a type field.

> The other direction something like this can go is to dump
> the data structures in linux/mmzone.h

Do you mean defining a data structure in linux/mmzone.h ?

I used to think a particular struct is not necessary for this work,
but now I think it is better to define a struct for this.
Let me consider.

> I have written a first pass at a user space core dump generator,
> using /dev/mem. /sbin/kexec still needs some work to prepare
> the ELF headers before a crash.

I am looking forward this :-)

And, you mentioned a couple of weeks ago:
> Anyway one thing I want to do is actually drop the apic shutdown
> code altogether in this code path. I threw it in there to
> ease the transition from the old code base to the new, but
> if that code is causing issues.... So this is probably a good time
> to start testing that.

How about this ?

Thanks.
--
Itsuro ODA <o...@valinux.co.jp>

-

Vivek Goyal

unread,

Feb 17, 2005, 12:10:08 AM2/17/05

to

Hi,

On Wed, 2005-02-16 at 14:19, Itsuro Oda wrote:

>
> BTW, does not kexec/kdump run on 2.6.11-rc3-mm2 ?
> How do I get and examine the latest kexec/kdump ?

Currently kdump is broken. I am working on Elf Header generation part in
kexec-tools. Next week I should be able to post the initial patches.

I thought efi related data structures are of type __initdata and will be gone after initilization. (efi.c)

Thanks
Vivek

Itsuro Oda

unread,

Feb 17, 2005, 1:30:15 AM2/17/05

to

Hi,

> I thought efi related data structures are of type __initdata and will be gone after initilization. (efi.c)

oops. certainly.
and, devmem_is_allowed does same mistake :-)
(I don't know who made it.)

Thanks.
--
Itsuro ODA <o...@valinux.co.jp>

-

Eric W. Biederman

unread,

Feb 17, 2005, 5:00:17 AM2/17/05

to

Itsuro Oda <o...@valinux.co.jp> writes:

> Hi Eric,
>
> > The lack of a type field looses a fair amount of functionality compared
> > to /proc/iomem. In particular you can't see where the ACPI data is.
>
> Hmm, restricting System RAM only may be too pessimistic.
> (One of motivations of this work is for using /dev/mem safely.
> "dd if=/dev/mem of=xxx" causes panic on my amd64(8GB mem) machine
> since reading from address around 0xfe000000 causes a machine
> check. hmm, this area is marked as "reserved". not ACPI area.
> ACPI area can be read.)
>
> Ok, I will add a type field.

To be very clear. I do not believe is necessary for x86. The
is already sufficient information elsewhere to handle this.

> > The other direction something like this can go is to dump
> > the data structures in linux/mmzone.h
>
> Do you mean defining a data structure in linux/mmzone.h ?
>
> I used to think a particular struct is not necessary for this work,
> but now I think it is better to define a struct for this.
> Let me consider.

To be clear there are two pieces of information that are needed.
1) The list of physical memory areas and what they are.
/proc/iomem does a good job of this.
2) The list of which memory areas the kernel is using.
It is the pgdat_t and related structures that define this.
For the purposes of a core dump we want to capture this
information before the kernel crashes and use it afterward.

> > I have written a first pass at a user space core dump generator,
> > using /dev/mem. /sbin/kexec still needs some work to prepare
> > the ELF headers before a crash.
>
> I am looking forward this :-)
>
> And, you mentioned a couple of weeks ago:
> > Anyway one thing I want to do is actually drop the apic shutdown
> > code altogether in this code path. I threw it in there to
> > ease the transition from the old code base to the new, but
> > if that code is causing issues.... So this is probably a good time
> > to start testing that.
>
> How about this ?

My role in this is that of maintainer and architect. On a practical
level I gain nothing from a working crash-dump/kexec-on-panic
implementation except it stops being a gating factor for the rest
of the kexec code. So while many times I can see what needs to be
done it is hard for me to justify doing it. So a lot of times
where I will weigh in with code is when I see a particular blind spot
on the part of the implementors.

The parties I see actively working on the crash dump implementation
are currently a group from IBM and you guys from valinux.co.jp.
One of the primaries at IBM has been on vacation which is likely
why we have not seen anything out of them for the last couple of
weeks.

But also this is open source software it will be done when it
is done.

Eric

Dave Jones

unread,

Feb 17, 2005, 1:30:14 PM2/17/05

to

On Wed, Feb 16, 2005 at 05:49:51PM +0900, Itsuro Oda wrote:
> Hi, Eric and all
>
> Attached is an implementation of /proc/cpumem.
> /proc/cpumem shows the valid physical memory ranges.
>
> * i386 and x86_64
> * implement valid_phys_addr_range() and use it.
> (the first argument of the i386 version is little uncomfortable.)
> * /dev/mem of the i386 version should be mofified. but not yet.
>
> example: amd64 8GB Mem
> # cat /proc/cpumem
> 0000000000000000 000000000009b800
> 0000000000100000 00000000fbe70000
> 0000000100000000 0000000100000000
> #
> start address and size. hex digit.
>
> Any comments, recomendations and suggestions are welcom.

It may make more sense to export the entire e820 (or similar)
bios memory tables. Probably better off in sysfs than adding
more cruft to procfs too.

Dave

Eric W. Biederman

unread,

Feb 17, 2005, 3:10:24 PM2/17/05

to

Dave Jones <da...@redhat.com> writes:

> On Wed, Feb 16, 2005 at 05:49:51PM +0900, Itsuro Oda wrote:
> > Hi, Eric and all
> >
> > Attached is an implementation of /proc/cpumem.
> > /proc/cpumem shows the valid physical memory ranges.
> >
> > * i386 and x86_64
> > * implement valid_phys_addr_range() and use it.
> > (the first argument of the i386 version is little uncomfortable.)
> > * /dev/mem of the i386 version should be mofified. but not yet.
> >
> > example: amd64 8GB Mem
> > # cat /proc/cpumem
> > 0000000000000000 000000000009b800
> > 0000000000100000 00000000fbe70000
> > 0000000100000000 0000000100000000
> > #
> > start address and size. hex digit.
> >
> > Any comments, recomendations and suggestions are welcom.
>
> It may make more sense to export the entire e820 (or similar)
> bios memory tables. Probably better off in sysfs than adding
> more cruft to procfs too.

Agreed. In practice we actually do this already with /proc/iomem.
Except that we truncate everything above 4GB, and we allow the
map to get mangled with mem=xxx options.

I brought up the idea of a /proc/cpumem by analogy because on platforms
that have an iommu and memory is in a distinct address space
there have been complaints that /proc/iomem just won't work. But it
is simple enough to do something that is just for the cpu's memory.

As for how to do this cleanly this looks like the start of that discussion.

Eric

Itsuro Oda

unread,

Feb 18, 2005, 1:20:07 AM2/18/05

to

Hi,

On 17 Feb 2005 02:55:31 -0700

ebie...@xmission.com (Eric W. Biederman) wrote:

> My role in this is that of maintainer and architect. On a practical
> level I gain nothing from a working crash-dump/kexec-on-panic
> implementation except it stops being a gating factor for the rest
> of the kexec code. So while many times I can see what needs to be
> done it is hard for me to justify doing it. So a lot of times
> where I will weigh in with code is when I see a particular blind spot
> on the part of the implementors.

I see. I would like to contribute as possible I can.

Thanks.
--
Itsuro ODA <o...@valinux.co.jp>

-

Eric W. Biederman

unread,

Feb 18, 2005, 2:30:10 AM2/18/05

to

Itsuro Oda <o...@valinux.co.jp> writes:

> I see. I would like to contribute as possible I can.

Pick some piece you that have an affinity for and work on it.
Problems are best solved by those who see them and by those who care :)

I believe Vivek Goyal is currently working on the remaining user space
piece, and expects to have something in a week or so.

Eric