[PATCH] Virtual machine-check architecture support

Philip Soltero

unread,

Jan 30, 2011, 12:59:13 AM1/30/11

to V3VEE Development

Here is the virtual machine-check architecture support patch that I
worked on last semester.

What it includes:
* A simple macro I use for code style.
* Virtual machine-check architecture code with API documentation.
* Sample configuration xml.

What it can do:
* Inject Northbridge MCEs on CPU 0 in a x86 guest.

What it should be able to do but won't:
* Inject Northbridge MCEs on CPU 0 in a x86-64 guest (not sure what
the problem is).

What it lacks:
* Injection of any MCE.
* Multicore support.
* Per-CPU banks and one shared Northbridge bank (bank 4).
* Virtualization of MCi_MISC registers.
* Various functionality specified in the Intel and AMD documentation.
* Build, configuration, and use documentation.

palacios_vmm_common_macros

palacios_vmm_mcheck

Philip Soltero

unread,

Jan 31, 2011, 2:13:33 PM1/31/11

to V3VEE Development

The palacios_vmm_mcheck patch won't apply cleanly without this MSR
debug patch:

Add an MSR debug configuration option.

From: Philip Soltero <psol...@cs.unm.edu>

---
Kconfig | 8 ++++++++
palacios/src/palacios/vmm_msr.c | 5 +++++
2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/Kconfig b/Kconfig
index dcee752..7348a5f 100644
--- a/Kconfig
+++ b/Kconfig
@@ -387,6 +387,14 @@ config DEBUG_DEV_MGR
help
This turns on debugging for the device manager

+config DEBUG_MSR
+ bool "MSR"
+ default n
+ depends on DEBUG_ON
+ help
+ This turns on debugging for MSR handling.
+
+

diff --git a/palacios/src/palacios/vmm_msr.c b/palacios/src/palacios/
vmm_msr.c
index 5fe8ecf..365ea84 100644
--- a/palacios/src/palacios/vmm_msr.c
+++ b/palacios/src/palacios/vmm_msr.c
@@ -22,6 +22,11 @@
#include <palacios/vmm.h>
#include <palacios/vm_guest.h>

+#ifndef CONFIG_DEBUG_MSR
+#undef PrintDebug
+#define PrintDebug(fmt, args...)
+#endif
+
static int free_hook(struct v3_vm_info * vm, struct v3_msr_hook *
hook);

void v3_init_msr_map(struct v3_vm_info * vm) {

Jack Lange

unread,

Jan 31, 2011, 11:46:24 PM1/31/11

to v3vee-de...@googlegroups.com

Hey Phil,

What exactly is the use case for this?

Is it for propagating physical MCEs to the guest, or is it purely for
injecting virtualized MCEs? Are these MCEs tied to a specific event
that needs to be signalled or are they just artificial events?

Also how are MCEs generated? Internal palacios events, external
events, user signals?

--Jack

> --
> You received this message because you are subscribed to the Google Groups "V3VEE Development" group.
> To post to this group, send email to v3vee-de...@googlegroups.com.
> To unsubscribe from this group, send email to v3vee-developm...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/v3vee-development?hl=en.
>
>

Philip Soltero

unread,

Feb 1, 2011, 11:30:14 AM2/1/11

to V3VEE Development

It was built to inject virtualized MCEs to the guest.

One example is the memory corrupter device that I worked on last
semester. A page of memory is hooked and some memory within it deemed
corrupted. When a guest reads from the corrupted memory an ECC DRAM
MCE is injected and the guest application will be given the chance to
deal with the problem.

To test the virtualized MCA I've been hooking a "special" CPUID. When
the guest reads it Palacios injects an MCE. This didn't work; the MCE
was injected but the guest didn't receive it. Thinking that there is
something "wrong" with injecting an MCE while handling a CPUID, I
implemented a 100 exit countdown that injects the MCE 100 VM exits
after the CPUID was encountered. This worked for x86 guests; however,
x86-64 guests are behaving as though they never received the
exception.

I'd like to know if you have any suggestions for a better signaling
strategy.

Philip

Jack Lange

unread,

Feb 2, 2011, 4:08:48 PM2/2/11

to v3vee-de...@googlegroups.com

Hey Phil,

I pushed the machine check stuff with some changes. But unfortunately
I don't really have an easy answer to the 2 hard problems you have. As
far as a channel to trigger machine checks, I'm not really sure. The
two options I see are using either the host event framework and
providing an exported injection interface from the machine check
framework. Since at the moment this is an experimental framework (in
its use, not form) I would also be ok with just exporting the
injection API directly to the host OS and setting up some sort of
trigger signal there.

As far as the generic CPUIDs go, yikes. What we probably need to have
is a centralized set of virtualized CPUID values for the common codes.
Then if a component wants to modify certain bits of the CPUID values
it will call something like v3_get_cpuid_regs(CPUID, &eax, &ebx, &ecx,
&edx), which will return the CPUID registers for a given CPUID code
(or value?). These registers could then be modified as needed, and
loaded back using v3_set_cpuid_regs(CPUID, eax, ebx, ecx, edx). This
would also let us properly virtualize the CPUID instruction, instead
of just passing it though as we do now.

Changes:
1) I changed it to be a device, instead of part of the core VMM so
that it would be easier to enable/disable it without a bunch of #ifdef
macros

2) I removed the configuration paramaters, as well as the sort of
passthrough behavior to the CPUID registers. Its simpler to just turn
everything on or off, unless there is an existing need to provide
finer grain functionality.

3) I removed all the goto's.... I can understand that there is a
certain need for them, especially in the initialization routines to
handle failures in the middle of a large number of allocations. But
even so I barely tolerate them as it is, and whenever I see a
multistage goto tree I just start deleting things.

The real problem with goto's is that they encourage laziness. It
allows people to collect a number of complex operations together
without really thinking about them, or to allow a code path to get to
a point at which the complexity is unmanageable. As such it is a
treatment for a symptom instead of an actual cure. For example I've
seen a number of things like this:

lock();
if (!can_fail()) goto error;
ATOMIC_SECTION();
error:
unlock();

vs.

if (!can_fail()) return -1;
lock();
ATOMIC_SECTION();
unlock();

As a general rule using goto's usually means that there is something
fundamentally broken with what you are doing. But enough of the
soapbox.

Also I screwed up the author entry on the commit, but pushed an
attribution commit afterwards to try and fix it. Sorry.

--Jack

Philip Soltero

unread,

Feb 3, 2011, 2:18:16 AM2/3/11

to v3vee-de...@googlegroups.com

Comments interspersed:

> Hey Phil,
>
> I pushed the machine check stuff with some changes. But unfortunately
> I don't really have an easy answer to the 2 hard problems you have. As
> far as a channel to trigger machine checks, I'm not really sure. The
> two options I see are using either the host event framework and
> providing an exported injection interface from the machine check
> framework. Since at the moment this is an experimental framework (in
> its use, not form) I would also be ok with just exporting the
> injection API directly to the host OS and setting up some sort of
> trigger signal there.
>

Thanks for the suggestion. I'll consider a host-based trigger mechanism
when I return to working on this.

> As far as the generic CPUIDs go, yikes. What we probably need to have
> is a centralized set of virtualized CPUID values for the common codes.
> Then if a component wants to modify certain bits of the CPUID values
> it will call something like v3_get_cpuid_regs(CPUID,&eax,&ebx,&ecx,
> &edx), which will return the CPUID registers for a given CPUID code
> (or value?). These registers could then be modified as needed, and
> loaded back using v3_set_cpuid_regs(CPUID, eax, ebx, ecx, edx). This
> would also let us properly virtualize the CPUID instruction, instead
> of just passing it though as we do now.
>

I wouldn't mind taking a crack at this in a few weeks unless it's on the
agenda for the release hackfest which I will not be attending.

> Changes:
> 1) I changed it to be a device, instead of part of the core VMM so
> that it would be easier to enable/disable it without a bunch of #ifdef
> macros
>

Makes sense. It actually started out that way.

> 2) I removed the configuration paramaters, as well as the sort of
> passthrough behavior to the CPUID registers. Its simpler to just turn
> everything on or off, unless there is an existing need to provide
> finer grain functionality.
>

Agreed about the variable-passthrough CPUIDs. The configurations
parameters, with the exception of the MCG CTL available bit, were not
really used.

> 3) I removed all the goto's.... I can understand that there is a
> certain need for them, especially in the initialization routines to
> handle failures in the middle of a large number of allocations. But
> even so I barely tolerate them as it is, and whenever I see a
> multistage goto tree I just start deleting things.
>
> The real problem with goto's is that they encourage laziness. It
> allows people to collect a number of complex operations together
> without really thinking about them, or to allow a code path to get to
> a point at which the complexity is unmanageable. As such it is a
> treatment for a symptom instead of an actual cure. For example I've
> seen a number of things like this:
>
> lock();
> if (!can_fail()) goto error;
> ATOMIC_SECTION();
> error:
> unlock();
>
> vs.
>
> if (!can_fail()) return -1;
> lock();
> ATOMIC_SECTION();
> unlock();
>
> As a general rule using goto's usually means that there is something
> fundamentally broken with what you are doing. But enough of the
> soapbox.
>

Not a problem. I was using them as both a matter of style and
convenience. I'll forgo their use in future patches.

> Also I screwed up the author entry on the commit, but pushed an
> attribution commit afterwards to try and fix it. Sorry.
>
> --Jack
>

Thanks for taking the time to review and cleanup the code as well as
taking the time to explain your changes. I tested the changes with my
mcheck test code and was able to inject a MCE into an x86 kernel. The
x86-64 case still doesn't work; not sure what is going on.

Reply all

Reply to author

Forward