I just noticed that one of my test boxes has a AMD processor so I took a
bit of a look into this.
The issue seems to be with this bit of code in the hypervisor
(xen/arch/x86/x86_64/entry.S):
restore_all_guest:
ASSERT_INTERRUPTS_DISABLED
RESTORE_ALL
testw $TRAP_syscall,4(%rsp)
jz iret_exit_to_guest
addq $8,%rsp
popq %rcx # RIP
popq %r11 # CS
cmpw $FLAT_USER_CS32,%r11
popq %r11 # RFLAGS
popq %rsp # RSP
je 1f
sysretq
1: sysretl
We are attempting to return to the Linux defined __USER_CS32 (0x23)
which does not match the test for the Xen defined FLAT_USER_CS32
(0xe023) and therefore we hit the sysretq instead of the sysretl which
causes us to return with CS 0xe033 (FLAT_USER_CS64) instead of CS
0xe023.
This patch to the kernel fixes things but doesn't seem that
satisfactory:
diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index 02f496a..203586d 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -93,7 +93,7 @@ ENTRY(xen_sysret32)
pushq $__USER32_DS
pushq PER_CPU_VAR(old_rsp)
pushq %r11
- pushq $__USER32_CS
+ pushq $FLAT_USER_CS32
pushq %rcx
pushq $VGCF_in_syscall
Coming from the other angle we could fix this in the hypervisor by
always returning to guest (user or kernel) via iret instead of sysret:
diff -r e7a1eab70fac xen/arch/x86/x86_64/entry.S
--- a/xen/arch/x86/x86_64/entry.S Mon Nov 09 10:24:54 2009 +0000
+++ b/xen/arch/x86/x86_64/entry.S Mon Nov 23 15:15:39 2009 +0000
@@ -48,22 +48,6 @@
restore_all_guest:
ASSERT_INTERRUPTS_DISABLED
RESTORE_ALL
- testw $TRAP_syscall,4(%rsp)
- jz iret_exit_to_guest
-
- addq $8,%rsp
- popq %rcx # RIP
- popq %r11 # CS
- cmpw $FLAT_USER_CS32,%r11
- popq %r11 # RFLAGS
- popq %rsp # RSP
- je 1f
- sysretq
-1: sysretl
-
- ALIGN
-/* No special register assumptions. */
-iret_exit_to_guest:
addq $8,%rsp
.Lft0: iretq
I think much of the issue stems from Xen defining several segment
descriptors which are essentially equivalent to the ones Linux uses. It
seems a bit ugly to expose these Xen defined descriptors to the guest
when it hasn't explicitly asked for them. On the other hand I'm not sure
what can realistically do since doing the Right Thing would seem to
involve looking up the descriptor in the GDT to determine if the
selector is 32 or 64 bit and/or context switching IA32_STAR in some
fashion to allow guests to specify their own userspace CS for sysret 32
and 64.
Perhaps simply not returning guest userspace with sysret (as above)
makes most sense, a syscall already takes a trap through the hypervisor
on both entry and exit so I'm not sure the difference between sysret and
iret is going to be noticeable.
Another option might be to define VGCF_compat_mode as a new flag to
HYPERVISOR_iret and select sysretq/sysretl based on that. This would
still expose Xen descriptors to guests which didn't ask for one but at
least it would (probably) be a compatible descriptor.
> It does not happen on XenSource 2.6.18 kernel
I assume that this kernel (perhaps coincidentally) manages to use
FLAT_USER_CS32 for compat mode processes.
> , or the Debian 2.6.26 kernel.
This was a forward ported 2.6.18-style kernel so I guess the same reason
as 2.6.18.
Ian.
--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Well, the problem is that this code ignores what the AMD spec stats:
| Because a SYSCALLed operating system can be entered from either 64-bit
| mode or compatibility mode, the corresponding SYSRET must know the mode
| to which it must return. [...] In the service-routine entry point
| code, a flag can be set indicating which type of SYSRET is needed upon
| exiting the called routine.
The code actually have to know if it was called from 64 or compatibility
mode, not assume it. And it also say that you have to use sysret, and
not iret.
Bastian
--
I have never understood the female capacity to avoid a direct answer to
any question.
-- Spock, "This Side of Paradise", stardate 3417.3
Sounds correct. This is tricky for a hypervisor since we don't know the
mode of the guest user-mode processes which made the syscall. The guest
kernel does know this which is why I proposed an additional
VGCF_compat_mode flag.
> And it also say that you have to use sysret, and not iret.
I don't believe that is the case (the processor would have to carry some
state for the entire duration of a syscall for it to make any
difference). I think the spec simply assumes that an OS author would
want to use sysret if they used syscall.
Ian.
Don't the hypercalls already always go via iret?
- testw $TRAP_syscall,4(%rsp)
- jz iret_exit_to_guest
IOW if TRAP_syscall is not set (i.e. this is a hypercall not a syscall)
then exit via iret.
> I wonder though whether it wouldn't be possible to alter the TRAP_syscall
> value (stored when entering the hypervisor) in do_iret(), so that whatever
> do_iret() puts on the stack will be processed by iret.
That would make the VGCF_in_syscall passed to the iret hypercall
meaningless/useless?
>
> >> It does not happen on XenSource 2.6.18 kernel
> >
> >I assume that this kernel (perhaps coincidentally) manages to use
> >FLAT_USER_CS32 for compat mode processes.
> >
> >> , or the Debian 2.6.26 kernel.
> >
> >This was a forward ported 2.6.18-style kernel so I guess the same reason
> >as 2.6.18.
>
> If your analysis was right, 2.6.18 as well as our forward ported kernels
> should also be affected (both ia32_sysenter_target and ia32_cstar_target
> store __USER32_CS to the frame, and return via HYPERVISOR_iret), yet
> supposedly they don't have the problem (though I can't say why that
> would be). So perhaps there's some other yet un-described aspect to
> this, or I'm being confused by something...
I didn't try any of these kernels myself so I don't really know what
happens.
But this is not just the return-to-user-space path you're changing, but
also the hypercall one. You certainly don't want an iret in that case.
I wonder though whether it wouldn't be possible to alter the TRAP_syscall
value (stored when entering the hypervisor) in do_iret(), so that whatever
do_iret() puts on the stack will be processed by iret.
>> It does not happen on XenSource 2.6.18 kernel
>
>I assume that this kernel (perhaps coincidentally) manages to use
>FLAT_USER_CS32 for compat mode processes.
>
>> , or the Debian 2.6.26 kernel.
>
>This was a forward ported 2.6.18-style kernel so I guess the same reason
>as 2.6.18.
If your analysis was right, 2.6.18 as well as our forward ported kernels
should also be affected (both ia32_sysenter_target and ia32_cstar_target
store __USER32_CS to the frame, and return via HYPERVISOR_iret), yet
supposedly they don't have the problem (though I can't say why that
would be). So perhaps there's some other yet un-described aspect to
this, or I'm being confused by something...
Jan
Oh yes, I was confused into thinking it was the same as VGCF_in_syscall
for some reason.
Ian.
>
> -- Keir
Yeah.
> > And it also say that you have to use sysret, and not iret.
> I don't believe that is the case (the processor would have to carry some
> state for the entire duration of a syscall for it to make any
> difference). I think the spec simply assumes that an OS author would
> want to use sysret if they used syscall.
Well, it is documented this way. If you ignore it, it can work (and does
in this case) but is undefined behaviour.
Bastian
--
Bones: "The man's DEAD, Jim!"
>> But this is not just the return-to-user-space path you're changing, but
>> also the hypercall one. You certainly don't want an iret in that case.
>
> Don't the hypercalls already always go via iret?
> - testw $TRAP_syscall,4(%rsp)
> - jz iret_exit_to_guest
> IOW if TRAP_syscall is not set (i.e. this is a hypercall not a syscall)
> then exit via iret.
I think not -- here TRAP_syscall means 'entered Xen via SYSCALL
instruction', not 'entered to do a syscall'. TRAP_syscall should be set
regardless of whether the SYSCALL instruction was executed by guest userland
or guest kernel.
-- Keir
Ah, good detective work.
> This patch to the kernel fixes things but doesn't seem that
> satisfactory:
>
It is a bit ugly. I guess you could just assert that FLAT_USER_CS32 is
part of the iret ABI so the guest has to use it, which appears to be the
defacto definition. The downside is that usermode could observe that it
has a non-standard cs selector; however, the Linux ABI doesn't define
the selector values (and they're different in native 32 bit vs compat
anyway, I think).
That would be a bit awkward to do in the iret fast path.
> Perhaps simply not returning guest userspace with sysret (as above)
> makes most sense, a syscall already takes a trap through the hypervisor
> on both entry and exit so I'm not sure the difference between sysret and
> iret is going to be noticeable.
>
> Another option might be to define VGCF_compat_mode as a new flag to
> HYPERVISOR_iret and select sysretq/sysretl based on that. This would
> still expose Xen descriptors to guests which didn't ask for one but at
> least it would (probably) be a compatible descriptor.
>
I don't think that's much of an improvement over using the Xen selector
for cs. Of course, it requires that the Xen selectors are actually part
of the ABI and won't change at some later point.
J
Linux freely uses iret to return from syscall for things like fork and
exec. They are complimentary instructions, but nothing requires them to
be paired.
J
Yes, I suppose that is reasonable.
> > I'm not sure
> > what can realistically do since doing the Right Thing would seem to
> > involve looking up the descriptor in the GDT to determine if the
> > selector is 32 or 64 bit and/or context switching IA32_STAR in some
> > fashion to allow guests to specify their own userspace CS for sysret 32
> > and 64.
> >
>
> That would be a bit awkward to do in the iret fast path.
Agreed, hence "realistically".
>
> > Perhaps simply not returning guest userspace with sysret (as above)
> > makes most sense, a syscall already takes a trap through the hypervisor
> > on both entry and exit so I'm not sure the difference between sysret and
> > iret is going to be noticeable.
> >
> > Another option might be to define VGCF_compat_mode as a new flag to
> > HYPERVISOR_iret and select sysretq/sysretl based on that. This would
> > still expose Xen descriptors to guests which didn't ask for one but at
> > least it would (probably) be a compatible descriptor.
> >
>
> I don't think that's much of an improvement over using the Xen selector
> for cs. Of course, it requires that the Xen selectors are actually part
> of the ABI and won't change at some later point.
I think the guest accessible-but-Xen-defined descriptors are part of the
ABI, so that's OK.
Ian.
Okay, I think I spotted the relevant difference: 2.6.18 and forward ports
set VGCF_in_syscall only when returning from 64-bit system calls (through
ret_from_sys_call) - 32-bit syscalls (regardless of the entry path taken)
return through int_ret_from_sys_call. 32-bit guest kernels shouldn't be
affected by this, as compat mode returns from the hypervisor
(compat_restore_all_guest) always use iret.
Jan
I think dropping the VCGF_in_syscall flag is the simplest possible fix
then. There doesn't seem to be a huge benefit to using sysret in this
case. Does this look OK?
J
Subject: [PATCH] xen: use iret for return from 64b kernel to 32b usermode
If Xen wants to return to a 32b usermode with sysret it must use the
right form. When using VCGF_in_syscall to trigger this, it looks at
the code segment and does a 32b sysret if it is FLAT_USER_CS32.
However, this is different from __USER32_CS, so it fails to return
properly if we use the normal Linux segment.
So avoid the whole mess by dropping VCGF_in_syscall and simply use
plain iret to return to usermode.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fi...@citrix.com>
diff --git a/arch/x86/xen/xen-asm_64.S b/arch/x86/xen/xen-asm_64.S
index 02f496a..f681d55 100644
--- a/arch/x86/xen/xen-asm_64.S
+++ b/arch/x86/xen/xen-asm_64.S
@@ -96,7 +96,7 @@ ENTRY(xen_sysret32)
pushq $__USER32_CS
pushq %rcx
- pushq $VGCF_in_syscall
+ pushq $0
1: jmp hypercall_iret
ENDPATCH(xen_sysret32)
RELOC(xen_sysret32, 1b+1)
Yes, with one (more cosmetic than really useful) adjustment - the flag
should also be dropped from the !CONFIG_IA32_EMULATION code path
at the end of the file.
In any case,
Acked-by: Jan Beulich <jbeu...@novell.com>
Jan
Looks OK and works for me in practice too.
Ian.
>
> J
>
> Subject: [PATCH] xen: use iret for return from 64b kernel to 32b usermode
>
> If Xen wants to return to a 32b usermode with sysret it must use the
> right form. When using VCGF_in_syscall to trigger this, it looks at
> the code segment and does a 32b sysret if it is FLAT_USER_CS32.
> However, this is different from __USER32_CS, so it fails to return
> properly if we use the normal Linux segment.
>
> So avoid the whole mess by dropping VCGF_in_syscall and simply use
> plain iret to return to usermode.
>
> Signed-off-by: Jeremy Fitzhardinge <jeremy.fi...@citrix.com>
Tested-by: Ian Campbell <ian.ca...@citrix.com>