2.3.30pre1 syscall w/6 args support?

Artur Skawina

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

Is this really absolutely necessary? I have SYSENTER based syscalls
mostly working, cutting the cost of a system call in half (currently
from 281 cycles for a int80 based getpid to only 137, but the numbers
will change a little as i finish it off). Supporting six args is not
really an option however (not w/o extra kernel side translations --
there are simply no registers left to pass the required info
(esp/eip))..

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Artur Skawina

On Wed, 8 Dec 1999, Artur Skawina wrote:
>
> Is this really absolutely necessary? I have SYSENTER based syscalls
> mostly working, cutting the cost of a system call in half (currently
> from 281 cycles for a int80 based getpid to only 137, but the numbers
> will change a little as i finish it off). Supporting six args is not
> really an option however (not w/o extra kernel side translations --
> there are simply no registers left to pass the required info
> (esp/eip))..

Actually, if you require passing of eip, you're almost certainly doing
something wrong.

The thing about eip is that you can (and should) be able to run the same
library and binaries on both a kernel with and without SYSENTER. Without
having to test for SYSENTER in user mode and do different library things.

And THAT in turn means that to do SYSENTER right, the kernel should map a
magic page into each process' address space (use the FIXMAP capability -
we're already using that to map things like the IOAPIC at a fixed
address), and you do a system call simply by doing a "call" to that magic
page: the kernel can set up whatever it wants (be it "int 0x80" for
old-style system calls, SYSENTER for new system calls, or some magic trap
for Merced - you get the idea) at that address.

And that in turn means that you don't need to save the return EIP for
SYSENTER, because the only way to get the EIP anyway is just from the
stack that was required for the call to the magic address - the actual
address of the "sysenter" instruction is not even interesting.

In fact, I would argue that the proper way to handle this is:

- no sysenter capability on the CPU: use "int 0x80":

magic_address:
movl 4(%esp),%ebx
movl 8(%esp),%ecx
movl 12(%esp),%edx
movl 16(%esp),%esi
movl 20(%esp),%edi
movl 24(%esp),%ebp
int $0x80
ret

- sysenter:

magic_address:
movl %esp,%ebx
sysenter

and in both cases the magic address is just called with:

- arguments on the stack in user mode
- %eax contains the system call number

and in the sysenter case you'll just end up having to fetch the arguments
from the stack (you need to touch the stack anyway in order to get the
return address).

At least with the above, we can handle _any_ future calling convention.

Linus

Ulrich Drepper

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

Linus Torvalds <torv...@transmeta.com> writes:

> In fact, I would argue that the proper way to handle this is:
>
> - no sysenter capability on the CPU: use "int 0x80":
>
> magic_address:
> movl 4(%esp),%ebx
> movl 8(%esp),%ecx
> movl 12(%esp),%edx
> movl 16(%esp),%esi
> movl 20(%esp),%edi
> movl 24(%esp),%ebp
> int $0x80
> ret

I don't like this at all and it is really unnecessary.

First, it's easy enough to recompile glibc if there is a new calling
convention. One only has to change one definition in the sources and
recompile.

And second, mentioning the sources, take a look at
sysdeps/unix/sysv/linux/i386/sysdep.h. The macros to create the
syscall code goes to great length to generate optimal code. I.e., no
additional arguments pushed etc.

IF/When sysenter comes it's easy enough to provide an alternative
sysdep.h version which has the definition for the processors which
support them. You can even have libraries with and without this
system call mechanism installed at the same time (e.g., when the
directory is exported via NFS) and the ld.so will pick the right one.

You are only wasting cycles without gaining anything which is not
possible otherwise as well.

--
---------------. drepper at gnu.org ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com `------------------------

Andi Kleen

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

Linus Torvalds <torv...@transmeta.com> writes:
>
> In fact, I would argue that the proper way to handle this is:
>
> - no sysenter capability on the CPU: use "int 0x80":
>
> magic_address:
> movl 4(%esp),%ebx
> movl 8(%esp),%ecx
> movl 12(%esp),%edx
> movl 16(%esp),%esi
> movl 20(%esp),%edi
> movl 24(%esp),%ebp
> int $0x80
> ret

This just destroyed all registers, so the syscall stub would need to save
them redundantly (entry.S does it anyways). With multiple magic addresses
per syscall-with-N-arguments it gets more ugly.

The obvious micro optimization of reversing the order of the register loads
and using magic_address+(6-NARGS)*4 would probably not scale too well to
IA64 magic.

Also it costs you one TLB entry.

-Andi

Linus Torvalds

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ulrich Drepper

On 7 Dec 1999, Ulrich Drepper wrote:
>
> First, it's easy enough to recompile glibc if there is a new calling
> convention. One only has to change one definition in the sources and
> recompile.

Wrong.

Anybody who thinks that having two different libc's on different machines
is acceptable is so out of touch with reality that it is scary.

> And second, mentioning the sources, take a look at
> sysdeps/unix/sysv/linux/i386/sysdep.h. The macros to create the
> syscall code goes to great length to generate optimal code. I.e., no
> additional arguments pushed etc.

Look at the example. It doesn't push any additional arguments either. It
may _read_ arguments that aren't there, but the overhead of that is about
a cycle.

Linus

Richard Gooch

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ulrich Drepper

Ulrich Drepper writes:
> Linus Torvalds <torv...@transmeta.com> writes:
>
> > In fact, I would argue that the proper way to handle this is:
> >
> > - no sysenter capability on the CPU: use "int 0x80":
> >
> > magic_address:
> > movl 4(%esp),%ebx
> > movl 8(%esp),%ecx
> > movl 12(%esp),%edx
> > movl 16(%esp),%esi
> > movl 20(%esp),%edi
> > movl 24(%esp),%ebp
> > int $0x80
> > ret
>

> I don't like this at all and it is really unnecessary.
>

> First, it's easy enough to recompile glibc if there is a new calling
> convention. One only has to change one definition in the sources and
> recompile.

I don't want to have to patch/hack libc and recompile just to use the
better syscall interface.

Imagine suddenly someone starts experimenting with a variety of
techniques and puts out kernel patches. I wouldn't want to have to
track those patches in libc as well. Too much work. I wouldn't bother
testing the kernel patches.

> IF/When sysenter comes it's easy enough to provide an alternative
> sysdep.h version which has the definition for the processors which
> support them. You can even have libraries with and without this
> system call mechanism installed at the same time (e.g., when the
> directory is exported via NFS) and the ld.so will pick the right one.

This sounds fragile.

Regards,

Richard....
Permanent: rgo...@atnf.csiro.au
Current: rgo...@ras.ucalgary.ca

Ulrich Drepper

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

Linus Torvalds <torv...@transmeta.com> writes:

> Anybody who thinks that having two different libc's on different machines
> is acceptable is so out of touch with reality that it is scary.

In your always friendly way of explaining your opinions (== laws) you
completely miss that there are/will be different versions anyhow.
Some people want to use other features (let's say, FPU, MXX) which are
not generally available and so there are different libraries. This is
not something new which is coming up with the syscalls. They simply
add another dimension which is absolutely no problem to manage.

And also: look what a syscall implementation would have to do. There
is still a syscall wrapper necessary since your magic code does not do
proper error handling. You have to

- get the arguments from the stack
- push them immediately again

(these two steps can only be avoided if you have a syscall with less
than two arguments; if you have less you can temporarily store the
return address in %edx, and %ebx in %ecx)

- call the magic function
- correct the stack to remove the parameters
- handle errors
- return

So besides your unnecessary memory reads in the magic function, you
also to read and write all parameters in the syscall wrapper and then
correct the stack pointer afterwards (which is a quite expensive
operation if you want to work with the stack pointer immediately
afterwards).

--
---------------. drepper at gnu.org ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com `------------------------

-

Artur Skawina

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

Linus Torvalds wrote:
>
> > Is this really absolutely necessary? I have SYSENTER based syscalls
> > mostly working, cutting the cost of a system call in half (currently
> > from 281 cycles for a int80 based getpid to only 137, but the numbers
> > will change a little as i finish it off). Supporting six args is not
> > really an option however (not w/o extra kernel side translations --
> > there are simply no registers left to pass the required info
> > (esp/eip))..
>
> Actually, if you require passing of eip, you're almost certainly doing
> something wrong.

well, while trying to get it to work i've often wondered what the
person who came up with the sysenter/sysexit scheme at intel had been
sniffing... [for people who might not be familiar with sysenter:
passing of eip/esp is necessary because sysenter doesn't save it, this is
true always, even with Linus' scheme; the only choice we have is _how_
to do that, and there's no perfect solution]

> The thing about eip is that you can (and should) be able to run the same
> library and binaries on both a kernel with and without SYSENTER. Without
> having to test for SYSENTER in user mode and do different library things.

the choice becomes: do you want the libc to do this (presumably just one
test on bootup which selects the proper stubs to use) or do you want the
kernel to do very much the same (by providing the stubs (or as you
suggested some kind of intermidiate single C entry point, which isn't a
good solution either).

> And THAT in turn means that to do SYSENTER right, the kernel should map a
> magic page into each process' address space (use the FIXMAP capability -
> we're already using that to map things like the IOAPIC at a fixed
> address), and you do a system call simply by doing a "call" to that magic
> page: the kernel can set up whatever it wants (be it "int 0x80" for
> old-style system calls, SYSENTER for new system calls, or some magic trap
> for Merced - you get the idea) at that address.

I did consider this. The reasons i didn't do it that way were (1) to
make it as nonitrusive as possible (the patch currently is just a few
dozen lines) and (2) policy -- adding hardwired userspace mappings.
(the performance difference would be ~ a few cycles, so that's not a
big issue). Now, if (1) isn't a problem i may consider that scheme again.
Assuming nobody sees any problems w/ (2) above?

> And that in turn means that you don't need to save the return EIP for
> SYSENTER, because the only way to get the EIP anyway is just from the
> stack that was required for the call to the magic address - the actual
> address of the "sysenter" instruction is not even interesting.

Until now i've actually ignored all 5 arg syscalls, as the int80
entry point is always there and the gain is smaller the more the
complex the syscall is; so i didn't have to do even that. But i had
planned on changing that -- this is one reason i said the numbers
given above are going to change a bit.

> In fact, I would argue that the proper way to handle this is:
>
> - no sysenter capability on the CPU: use "int 0x80":
>
> magic_address:
> movl 4(%esp),%ebx
> movl 8(%esp),%ecx
> movl 12(%esp),%edx
> movl 16(%esp),%esi
> movl 20(%esp),%edi
> movl 24(%esp),%ebp
> int $0x80
> ret
>

> - sysenter:
>
> magic_address:
> movl %esp,%ebx
> sysenter
>
> and in both cases the magic address is just called with:
>
> - arguments on the stack in user mode
> - %eax contains the system call number
>
> and in the sysenter case you'll just end up having to fetch the arguments
> from the stack (you need to touch the stack anyway in order to get the
> return address).
>
> At least with the above, we can handle _any_ future calling convention.

Yep. In theory this seems perfect. Other people have already pointed out some
obvious issues, there are probably more; most are likely solvable, but could
add significantly to the complexity. [there are some less obvious issues like
you can not, for various reasons, blindly use the sysenter/sysexit pair
(eg you can not return with sysexit from execve) and in some cases i found
int80 to be faster (eg signal handling, when restarting syscalls)]

> Anybody who thinks that having two different libc's on different machines
> is acceptable is so out of touch with reality that it is scary.

This has been true for quite a while, having an optimized libc for
your architecture makes sense. If you need to share libraries you
just pick the least common demoninator (and pay the price in performance,
but in reality that's usually very low).

Not that i think that this is relevant in this case -- the question is:
is there anything that the kernel can do that could not be done equally
well in libc? Why shouldn't libc be setting up that special magic entry
point all by itself?

[i'm obviously a bit biased since i already have sysenter working w/o
the kernel mapped stubs, but wouldn't have a problem switching to
another solution, if i'd be convinced it would be beneficial.
SYSENTER/EXIT has it's problems, but the numbers speak for themselfes.
A few examples:
INT80 SYSENTER
lat_syscall null 0.6814 0.3394
lat_syscall read 1.0477 0.6747
lat_syscall write 0.8422 0.4937
lat_fifo 7.4193 6.3050
lat_pipe 7.4369 6.3880
lat_unix 14.9843 14.1488

time dd if=/dev/zero of=/dev/null bs=1 count=10000000
0:20.21 0:13.00
]

Linus Torvalds

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Artur Skawina

On Wed, 8 Dec 1999, Artur Skawina wrote:
>

> well, while trying to get it to work i've often wondered what the
> person who came up with the sysenter/sysexit scheme at intel had been
> sniffing...

I agree. I don't understand why they couldn't just have saved the old
eip/esp in another msr or something.

> the choice becomes: do you want the libc to do this (presumably just one
> test on bootup which selects the proper stubs to use) or do you want the
> kernel to do very much the same (by providing the stubs (or as you
> suggested some kind of intermidiate single C entry point, which isn't a
> good solution either).

It wouldn't be a C entry-point: it would really be an assembly
entry-point, and you'd have special calling conventions for it (for one
thing, you'd have to set up %eax to be the system call number).

As far as the kernel would be concerned, it would be just another of the
"high virtual memory mappings" - the kernel already uses them internally
simply because it is a good idea to avoid a pointer lookup for some things
that are so common that you can more efficiently cache them in the TLB. So
the kernel as certain magical addresses where the IO-APIC is always at one
fixed virtual address instead of having to be looked up on every access to
it.

So this would be just another such "virtually pinned" page, except it
would be readable from user space too (but obviously not writable). It can
contain any number of trampolines and/or other data (although I don't
really see what static data would ever be that timing-critical).

> I did consider this. The reasons i didn't do it that way were (1) to
> make it as nonitrusive as possible (the patch currently is just a few
> dozen lines) and (2) policy -- adding hardwired userspace mappings.
> (the performance difference would be ~ a few cycles, so that's not a
> big issue). Now, if (1) isn't a problem i may consider that scheme again.
> Assuming nobody sees any problems w/ (2) above?

Note that we definitely don't want to eat up virtual space in the
"traditional" user space area - 0x00000000 - 0xC0000000. The pinned down
thing would be a kernel mapping, basically a vmalloc()-like thing except
normal vmallocs are not accessible from user space (the page tables have
been set up to not allow access to them for obvious reasons).

(And no, I'm not really suggesting you use vmalloc() - I'm really
suggesting you look at the stuff in <asm/fixmap.h> which gets set up at
predefined addresses rather than the more dynamic vmalloc()).

> Yep. In theory this seems perfect. Other people have already pointed out some
> obvious issues, there are probably more; most are likely solvable, but could
> add significantly to the complexity. [there are some less obvious issues like
> you can not, for various reasons, blindly use the sysenter/sysexit pair
> (eg you can not return with sysexit from execve) and in some cases i found
> int80 to be faster (eg signal handling, when restarting syscalls)]

sysenter is really a strange thing. Have you verified that your current
code works with vm86 mode programs like dosemu or with wine, for example?

I can see why intel wanted to implement sysenter/sysexit, but it IS a
rather strange way of doing what they did. Not very flexible.

> Not that i think that this is relevant in this case -- the question is:
> is there anything that the kernel can do that could not be done equally
> well in libc? Why shouldn't libc be setting up that special magic entry
> point all by itself?

Because libc will do it every single time a process gets started.

I'm a latency person. I've tried to talk to people about pre-linking libc
at a fixed address, and avoid all the horrible run-time linking for the
case when the pre-linked address is available. So far nobody has done
this on Linux, and it makes our process startup slower.

I don't want to make it worse.

Linus

Artur Skawina

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

Linus Torvalds wrote:
>
> I agree. I don't understand why they couldn't just have saved the old
> eip/esp in another msr or something.

And like that wouldn't be enough once you start using sysenter all the
other seemingly minor details start to bite you (like it looses not only
VM but also the IF flag. so now every syscall turns on interrupts...
I guess i'll have to declare that a feature :) )

> It wouldn't be a C entry-point: it would really be an assembly
> entry-point, and you'd have special calling conventions for it (for one
> thing, you'd have to set up %eax to be the system call number).

I am not convinced doing the magic entry point will be a win; i
certainly agree with Ulrich that the libc solution is more attractive.
But we'll get the global fixmapped region anyway, so there's a chance
we will be able to compare both.

> As far as the kernel would be concerned, it would be just another of the
> "high virtual memory mappings" - the kernel already uses them internally
> simply because it is a good idea to avoid a pointer lookup for some things
> that are so common that you can more efficiently cache them in the TLB. So
> the kernel as certain magical addresses where the IO-APIC is always at one
> fixed virtual address instead of having to be looked up on every access to
> it.
>
> So this would be just another such "virtually pinned" page, except it
> would be readable from user space too (but obviously not writable). It can
> contain any number of trampolines and/or other data (although I don't
> really see what static data would ever be that timing-critical).

hmm, rdtsc based gettimeofday?

> > I did consider this. The reasons i didn't do it that way were (1) to
> > make it as nonitrusive as possible (the patch currently is just a few
> > dozen lines) and (2) policy -- adding hardwired userspace mappings.
> > (the performance difference would be ~ a few cycles, so that's not a
> > big issue). Now, if (1) isn't a problem i may consider that scheme again.
> > Assuming nobody sees any problems w/ (2) above?
>
> Note that we definitely don't want to eat up virtual space in the
> "traditional" user space area - 0x00000000 - 0xC0000000. The pinned down
> thing would be a kernel mapping, basically a vmalloc()-like thing except
> normal vmallocs are not accessible from user space (the page tables have
> been set up to not allow access to them for obvious reasons).
>
> (And no, I'm not really suggesting you use vmalloc() - I'm really
> suggesting you look at the stuff in <asm/fixmap.h> which gets set up at
> predefined addresses rather than the more dynamic vmalloc()).

i did look after the first time you mentioned it.
i'll try playing with that.

> sysenter is really a strange thing. Have you verified that your current
> code works with vm86 mode programs like dosemu or with wine, for example?

No, it does not. I haven't even written the code handling that case, for
the very simple reason i haven't yet figured out _what_ to do when a
vm86 task executes SYSENTER. This is one of only two unresolved issues
left. As it is impossible to recover from this (w/o a cooperative
userspace, that is. sysenter drops eip/esp/cs/ss) what should happen?
[i don't use any vm86 tasks so i'd find it acceptable to simply _exit(),
but there might be a better way... Ideas?]

Oh, there's something else i have on the todo list, but haven't yet
looked into -- SYSENTER doesn't clear TF.

Message has been deleted

Brian Gerst

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linux kernel mailing list

Linus Torvalds wrote:
> At least with the above, we can handle _any_ future calling convention.

Also keep in mind the SYSCALL/SYSRET capability on K6 (and probably
Athlon) CPUs.

--

Brian Gerst

Olaf Titz

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to linux-...@vger.rutgers.edu

In article <7UUFA...@khms.westfalen.de> you write:
> torv...@transmeta.com (Linus Torvalds) wrote on 07.12.99 in

> > contain any number of trampolines and/or other data (although I don't
> > really see what static data would ever be that timing-critical).

> Not static data, no. But think about putting some dynamic data there.
> Like, say, some version of user-space readable jiffies. There are some

The obvious candidates would be all those items currently read by
simple syscalls like gettimeofday(), time(), getpid(), getppid(),
getuid(), getgid(), etc. so those operations (where the syscall
overhead strikes most) become simple reads (with an adapted libc).
This would especially rock getpid()-based benchmarks ;-) but also good
for X11 applications, etc. which frequently call gettimeofday().

Perhaps it would even be beneficial to map (parts of?) the task
structure there.

Olaf

Linus Torvalds

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Artur Skawina

On Wed, 8 Dec 1999, Artur Skawina wrote:

> > contain any number of trampolines and/or other data (although I don't
> > really see what static data would ever be that timing-critical).
>

> hmm, rdtsc based gettimeofday?

There's a ton of details like this that are worth exploring,
but that right now are not worth it because we don't have the
infrastructure.

gettimeofday() is a great example. We don't just want to assume "rdtsc" in
user space, and even if we did we wouldn't want to re-calibrate all the
time. But yes, it would work very nicely indeed with the global area
approach (and when the CPU doesn't have rdtsc, the global area would just
do an old-fashioned system call).

A "node ID" may be another global static thing that would be useful in the
future, but that's certainly not an issue today.

> > sysenter is really a strange thing. Have you verified that your current
> > code works with vm86 mode programs like dosemu or with wine, for example?
>
> No, it does not. I haven't even written the code handling that case, for
> the very simple reason i haven't yet figured out _what_ to do when a
> vm86 task executes SYSENTER.

Oh, you can't do that. SYSENTER just loses all the information, so you
don't really have much choice: you can't get it right as far as I can
tell.

I was more thinking about the case where SYSENTER was used for the vm86()
system call, and you'd have to be careful _not_ to use SYSEXIT because you
can't use SYSEXIT to return to vm86 mode - you have to use the same old
"iret" for that case. That shouldn't be too hard: just check the flags on
the return path and if the flags imply a return to VM86 you just use the
old path.

The same is true of Wine: you just need to check the DS/SS segment values
on return (and if they are anything but USER_DS you need to use "iret"
again).

You may have all this code already, I was just wondering. It doesn't look
like rocket science, but it _does_ look like there's just a lot of details
that need to be just right...

> This is one of only two unresolved issues
> left. As it is impossible to recover from this (w/o a cooperative
> userspace, that is. sysenter drops eip/esp/cs/ss) what should happen?

The thing is, you don't even have enough information to know whether the
user was co-operative or not. A "bad" use of SYSENTER will look exactly
like a good one, so you really cannot tell. As such, I wouldn't worry
about the issue, because there is nothing you can possibly do about it
anyway.

Silly SYSENTER semantics.

> [i don't use any vm86 tasks so i'd find it acceptable to simply _exit(),
> but there might be a better way... Ideas?]

How would you know to _exit()? You'll just have to think that it's a real
system call, and then on the return path the process will probably receive
a SIGSEGV because the stack is crap etc.. But that's ok - just another bad
pointer dereference..

> Oh, there's something else i have on the todo list, but haven't yet
> looked into -- SYSENTER doesn't clear TF.

Does it matter?

Linus

Ingo Molnar

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

On Wed, 8 Dec 1999, Linus Torvalds wrote:

> I was more thinking about the case where SYSENTER was used for the vm86()
> system call, and you'd have to be careful _not_ to use SYSEXIT because you
> can't use SYSEXIT to return to vm86 mode - you have to use the same old
> "iret" for that case. That shouldn't be too hard: just check the flags on
> the return path and if the flags imply a return to VM86 you just use the
> old path.

i believe the best solution is to not call sys_vm86() with SYSENTER. The
reason is to not slow down all the other system calls with an extra check
for VM_MASK.

(it will still be possible to call sys_vm86() with SYSENTER and it cannot
crash the system - but it will not have the intended effect.) Same holds
for all other flag-modifying system calls: eg. iopl(). These are not
high-speed system calls anyway, so i dont think we should care.

> The same is true of Wine: you just need to check the DS/SS segment values
> on return (and if they are anything but USER_DS you need to use "iret"
> again).

whoops, nasty. I'm not quite sure how we could do this though - SYSENTER
destroys CS and SS, so we have nothing to check ... Probably Wine has to
use the old int $80 interface?

> > Oh, there's something else i have on the todo list, but haven't yet
> > looked into -- SYSENTER doesn't clear TF.
>
> Does it matter?

it doesnt matter on 2.3.

-- mingo

Ingo Molnar

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

> > The same is true of Wine: you just need to check the DS/SS segment values
> > on return (and if they are anything but USER_DS you need to use "iret"
> > again).
>
> whoops, nasty. I'm not quite sure how we could do this though - SYSENTER
> destroys CS and SS, so we have nothing to check ... Probably Wine has to
> use the old int $80 interface?

hm, a solution could be to simply save %ds/%ss within Wine before doing
the SYSENTER stuff. This has to be hashed out with glibc i guess (normal
glibc users do not want this), but should solve the problem, no?
saving/restoring %ds/%ss is still much faster than doing int $80.

Linus Torvalds

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ingo Molnar

On Wed, 8 Dec 1999, Ingo Molnar wrote:
>
> i believe the best solution is to not call sys_vm86() with SYSENTER. The
> reason is to not slow down all the other system calls with an extra check
> for VM_MASK.

Well... You'll also have to cover stuff like "sigreturn" etc, which _can_
be quite timing critical.

And there's also the issue of non-system-calls. Things like the return
from the page fault handler could very well be speeded up using SYSEXIT,
by just making SYSEXIT part of the regular return sequence. Basically,
change the "restore_all" code in arch/i386/kernel/entry.S to dynamically
switch between "iret" and "SYSEXIT" - then _every_ normal kernel exit gets
speeded up.

No special cases. You lose by having three comparisons: you need to check
DS, SS and weflags. But that's on the order of a few cycles, and you win
by using SYSEXIT whenever possible - including interrupts, page faults,
etc.

I agree that you could _additionally_ have the simple-fast-system-call
case, but I wonder if the few cycles you save by avoiding the comparison
is even worth it.

Linus

Brian Gerst

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ingo Molnar

Ingo Molnar wrote:
> > The same is true of Wine: you just need to check the DS/SS segment values
> > on return (and if they are anything but USER_DS you need to use "iret"
> > again).
>
> whoops, nasty. I'm not quite sure how we could do this though - SYSENTER
> destroys CS and SS, so we have nothing to check ... Probably Wine has to
> use the old int $80 interface?

As far as I know, Wine currently only uses the extra segments for
Windows code. Any Linux syscalls should only be coming from the main
code segment.

I don't know about Intel's SYSENTER/SYSEXIT, but with AMD's
SYSCALL/SYSRET you cannot return to user space with iret after a
SYSCALL. The CPU sets an internal flag that causes a GPF when anything
modifies %cs except for SYSRET or an interrupt (hardware or software).
This has been a big stumbling block for me with adding SYSCALL support
becuase it interferes with task switching (ie. can't switch to another
task that would do an iret), unless it is restricted to syscalls that
cannot sleep.

--

Brian Gerst

Ingo Molnar

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Brian Gerst

On Wed, 8 Dec 1999, Brian Gerst wrote:

> As far as I know, Wine currently only uses the extra segments for
> Windows code. Any Linux syscalls should only be coming from the main
> code segment.

that is great. No changed %ds, %es, %ss either when glibc is called?

> I don't know about Intel's SYSENTER/SYSEXIT, but with AMD's
> SYSCALL/SYSRET you cannot return to user space with iret after a
> SYSCALL. The CPU sets an internal flag that causes a GPF when
> anything modifies %cs except for SYSRET or an interrupt (hardware or
> software). This has been a big stumbling block for me with adding
> SYSCALL support becuase it interferes with task switching (ie. can't
> switch to another task that would do an iret), unless it is restricted
> to syscalls that cannot sleep.

i've just tested this on Xeon CPUs, and it works. (i forced the fastcall
entry point to reschedule unconditionally, the system still uses int $80)

I agree that this would be a serious showstopper. Are you sure you werent
bitten by the stack-MSR issue? (we have to reload the SYSENTER kernel-ESP
MSR on every task switch)

-- mingo

Ingo Molnar

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

On Wed, 8 Dec 1999, Linus Torvalds wrote:

> > i believe the best solution is to not call sys_vm86() with SYSENTER. The
> > reason is to not slow down all the other system calls with an extra check
> > for VM_MASK.
>
> Well... You'll also have to cover stuff like "sigreturn" etc, which _can_
> be quite timing critical.

hm. Signals themselves can be handled easily i believe - setup_frame()
modifies not only regs->eip and regs->esp but also regs->edi and regs->ebp
[the two registers carrying the SYSEXIT return EIP and ESP]. Is there any
problem to be expected here?

sigreturn can be handled by generating a SYSENTER on the stack, instead of
the 'popl %eax ; movl $,%eax ; int $0x80' we do currently.

> And there's also the issue of non-system-calls. Things like the return
> from the page fault handler could very well be speeded up using SYSEXIT,
> by just making SYSEXIT part of the regular return sequence. Basically,
> change the "restore_all" code in arch/i386/kernel/entry.S to dynamically
> switch between "iret" and "SYSEXIT" - then _every_ normal kernel exit gets
> speeded up.

Hm, these non-glibc entry points in fact are very interesting because it
can be done without changing glibc.

> No special cases. You lose by having three comparisons: you need to check
> DS, SS and weflags. But that's on the order of a few cycles, and you win
> by using SYSEXIT whenever possible - including interrupts, page faults,
> etc.

SS is destroyed and overwritten with the kernel SS by SYSENTER - i dont
think we could get around that. Can we require user-space to save any
potential fancy extra segments before doing a fastcall?

Brian Gerst

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ingo Molnar

Ingo Molnar wrote:
>
> On Wed, 8 Dec 1999, Brian Gerst wrote:
>
> > As far as I know, Wine currently only uses the extra segments for
> > Windows code. Any Linux syscalls should only be coming from the main
> > code segment.
>
> that is great. No changed %ds, %es, %ss either when glibc is called?

I'd have to scour the wine source to verify that.

> > I don't know about Intel's SYSENTER/SYSEXIT, but with AMD's
> > SYSCALL/SYSRET you cannot return to user space with iret after a
> > SYSCALL. The CPU sets an internal flag that causes a GPF when
> > anything modifies %cs except for SYSRET or an interrupt (hardware or
> > software). This has been a big stumbling block for me with adding
> > SYSCALL support becuase it interferes with task switching (ie. can't
> > switch to another task that would do an iret), unless it is restricted
> > to syscalls that cannot sleep.
>
> i've just tested this on Xeon CPUs, and it works. (i forced the fastcall
> entry point to reschedule unconditionally, the system still uses int $80)
>
> I agree that this would be a serious showstopper. Are you sure you werent
> bitten by the stack-MSR issue? (we have to reload the SYSENTER kernel-ESP
> MSR on every task switch)

Positive. The kernel stack pointer isn't set with the K6's MSR, which
is unfortunate. Only %cs, %ss, and the kernel entry point are set.
%esp still points to the userspace stack after SYSCALL and must be
reloaded with the kernel's stack manually, which will be very tricky in
an SMP environment if/when Athlon SMP systems appear.

There are two versions of the SYSCALL support. Version 1 is on the K6
and K6-2's before stepping 8, and version 2 is later K6-2s, the K6-3,
and I presume the Athlon. My testing was done with version 1. I now
have a K6-2 with the version 2 support and I will run some tests later
tonight. As far as I know, the only difference between versions is in
how %cs and %ss are set upon SYSRET.

Documentation on SYSCALL/SYSRET can be found at
http://www.amd.com/K6/k6docs/pdf/21086.pdf

--

Brian Gerst

Linus Torvalds

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ingo Molnar

On Wed, 8 Dec 1999, Ingo Molnar wrote:
>
> SS is destroyed and overwritten with the kernel SS by SYSENTER - i dont
> think we could get around that. Can we require user-space to save any
> potential fancy extra segments before doing a fastcall?

It's not destroyed if you just say "you can only use SYSENTER with the
standard SS and DS".

Then the SYSENTER code just builds a perfectly normal stack, EXACTLY as if
it was entered with a regular system call. Again, absolutely no special
cases. We just hardcode the return address (from the user stack), SS and
DS - exactly because they get destroyed by the instruction itself.

Then, the return path has the same stack regardless of whether it was
entered through a page fault, a old system call or a new system call. And
it always tries to return with SYSEXIT.

The problem with SYSEXIT is that it clobbers %edx and %ecx (or whatever -
I forget what the registers were and I'm too lazy to look them up), and
that makes it nasty to use for the generic case. But we'd have that user-
space fixup area..

Linus

Ingo Molnar

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

On Wed, 8 Dec 1999, Linus Torvalds wrote:

> The problem with SYSEXIT is that it clobbers %edx and %ecx (or whatever -
> I forget what the registers were and I'm too lazy to look them up), and

(yep it's %edx+%ecx for SYSEXIT)

> that makes it nasty to use for the generic case. But we'd have that user-
> space fixup area..

hm, i'm not sure wether glibc relies on int $80 not clobbering %edx and
%ecx.

-- mingo

Linus Torvalds

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ingo Molnar

On Wed, 8 Dec 1999, Ingo Molnar wrote:
> On Wed, 8 Dec 1999, Linus Torvalds wrote:
>
> > The problem with SYSEXIT is that it clobbers %edx and %ecx (or whatever -
> > I forget what the registers were and I'm too lazy to look them up), and
>
> (yep it's %edx+%ecx for SYSEXIT)
>
> > that makes it nasty to use for the generic case. But we'd have that user-
> > space fixup area..
>
> hm, i'm not sure wether glibc relies on int $80 not clobbering %edx and
> %ecx.

Well, it's probably ok to use SYSEXIT for system calls, but I was thinking
of the non-system-call case where you _really_ cannot clobber %edx/%ecx.
Clobbering registers on page faults is considered rude ;)

Maybe we can't use SYSEXIT for non-system-calls, but it would just be so
nice if there were no special cases. A user-space trampoline that restores
%ecx/%edx after the SYSEXIT would allow that. But it may just be too
complex to really worry about.

Linus

Ulrich Drepper

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ingo Molnar

Ingo Molnar <mi...@chiara.csoma.elte.hu> writes:

> hm, i'm not sure wether glibc relies on int $80 not clobbering %edx and
> %ecx.

The syscall wrappers definitely rely on this.

--
---------------. drepper at gnu.org ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Cygnus Solutions `--' drepper at cygnus.com `------------------------

-

Ulrich Weigand

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Ingo Molnar

In linux-kernel you write:

>On Wed, 8 Dec 1999, Linus Torvalds wrote:

>> The same is true of Wine: you just need to check the DS/SS segment values
>> on return (and if they are anything but USER_DS you need to use "iret"
>> again).

>whoops, nasty. I'm not quite sure how we could do this though - SYSENTER
>destroys CS and SS, so we have nothing to check ... Probably Wine has to
>use the old int $80 interface?

Maybe I can clarify the use of non-standard segments within Wine:
While we have to load non-standard selectors into the segment registers
when executing 16-bit Windows code, the standard values are always restored
by the Wine 16<->32 bit glue code before we call any library routine or
perform a syscall. (glibc would get quite confused anyway if called with
a non-standard selector loaded in %ds, as it couldn't access its own
global data etc.)

There is just one point where we need OS assistance in managing segments:
If a signal arrives while the Wine process is currently executing 16-bit
code, we need the OS to switch the segment registers back to the standard
settings before executing the signal handler. Similarly, the sigreturn
code would then need to restore the segment registers before continuing
to execute the interrupted 16-bit code. In special cases, e.g. when the
signal triggers the Wine-internal debugger, we might even want to *change*
the selector values in the context structure passed to the signal handler,
and have the OS reload those changed values before resuming execution.
All this works fine currently.

If I understand the signal handling mechanism in Linux correctly, return
from a signal handler works by performing a sigreturn syscall. As this
syscall will need to return to 16-bit code with non-standard segment
registers, using a SYSENTER/SYSEXIT mechanism would appear problematical
in this case. All *other* syscalls should pose no problem even under Wine,
however.

[ There's one special use of segment registers by Wine even when executing
32-bit code: Windows apps require that the %fs register points to a
thread information block containing essential parameters related to the
current thread, e.g. thread ID, thread-local storage, exception handler
chain etc. For efficency reasons we keep that value loaded into %fs
throughout the execution of Wine, even while performing glibc calls or
syscalls. As %fs is not normally used by any library, this doesn't
hurt currently.

However, even when switching to a SYSENTER/SYSEXIT syscall mechanism,
this should pose no problem, I think, as those instructions don't treat
%fs in any special way ... ]

Bye,
Ulrich

--
Ulrich Weigand,
IMMD 1, Universitaet Erlangen-Nuernberg,
Martensstr. 3, D-91058 Erlangen, Phone: +49 9131 85-7688

Frank van Maarseveen

unread,

Dec 8, 1999, 3:00:00 AM12/8/99

to Linus Torvalds

On Tue, Dec 07, 1999 at 09:49:59PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 8 Dec 1999, Artur Skawina wrote:
> >
> > well, while trying to get it to work i've often wondered what the
> > person who came up with the sysenter/sysexit scheme at intel had been
> > sniffing...
>

> I agree. I don't understand why they couldn't just have saved the old
> eip/esp in another msr or something.

Isn't it possible to use the lower (shifted) eip bits as the system call
number when the magic page trick is in place? If so, that could explain it.

--
Frank

Artur Skawina

unread,

Dec 9, 1999, 3:00:00 AM12/9/99

to Linus Torvalds

[note that all i really wanted was to point out one of the consequences
of introducing syscalls with six args, which 2.1.31pre1 did. The
SYSENTER stuff is not as simple as it looks at first; i intended, and
will, post the code, but i want to take care if all the issues i
already know about first (otherwise it's fully functional, i can eg
run benchmarks, compile kernels etc already). I'll try doing that
~ the weekend.]

Linus Torvalds wrote:
>
> > > sysenter is really a strange thing. Have you verified that your current
> > > code works with vm86 mode programs like dosemu or with wine, for example?
> >
> > No, it does not. I haven't even written the code handling that case, for
> > the very simple reason i haven't yet figured out _what_ to do when a
> > vm86 task executes SYSENTER.
>
> Oh, you can't do that. SYSENTER just loses all the information, so you
> don't really have much choice: you can't get it right as far as I can
> tell.
>

> I was more thinking about the case where SYSENTER was used for the vm86()
> system call, and you'd have to be careful _not_ to use SYSEXIT because you
> can't use SYSEXIT to return to vm86 mode - you have to use the same old
> "iret" for that case. That shouldn't be too hard: just check the flags on
> the return path and if the flags imply a return to VM86 you just use the
> old path.
>

> The same is true of Wine: you just need to check the DS/SS segment values
> on return (and if they are anything but USER_DS you need to use "iret"
> again).
>

> You may have all this code already, I was just wondering. It doesn't look
> like rocket science, but it _does_ look like there's just a lot of details
> that need to be just right...

Yep. and all the cases i've so far identified are handled (although
i'm assuming ds is std on entry for non-vm86 tasks (note the ss case
you don't even get to check)).

> > This is one of only two unresolved issues
> > left. As it is impossible to recover from this (w/o a cooperative
> > userspace, that is. sysenter drops eip/esp/cs/ss) what should happen?
>
> The thing is, you don't even have enough information to know whether the
> user was co-operative or not. A "bad" use of SYSENTER will look exactly
> like a good one, so you really cannot tell. As such, I wouldn't worry
> about the issue, because there is nothing you can possibly do about it
> anyway.

exactly.

[there is something you could do -- not make SYSENTER available from vm86,
ie clear the msr while switching to a vm86 task, and reset it otherwise.
I don't like the cost though, as it likely means two additional conditional
branches and an extra msr write in switchto()]

[if fact what i'm doing right now is trying to put as many cases as possible
back on the sysexit path -- it _will_ make it a few cycles slower, but
there are other reasons i need the extra code for anyway]

> > [i don't use any vm86 tasks so i'd find it acceptable to simply _exit(),
> > but there might be a better way... Ideas?]
>
> How would you know to _exit()? You'll just have to think that it's a real
> system call, and then on the return path the process will probably receive
> a SIGSEGV because the stack is crap etc.. But that's ok - just another bad
> pointer dereference..

ok, that's simple enough. But do we really want linux syscalls executed
from inside wine, dosemu etc?

[i'll ignore vmware for now, but there might be some interesting
possibilities there... :^) ]

> > Oh, there's something else i have on the todo list, but haven't yet
> > looked into -- SYSENTER doesn't clear TF.
>
> Does it matter?

As i said, finding out is on my todo list. [iirc it's not fatal, but i
did not investigate this fully, it could make a difference, and even
if it does, should be easy to fix]

Brian Gerst

unread,

Dec 9, 1999, 3:00:00 AM12/9/99

to Ingo Molnar

I just tested iret after syscall on my K6-2, and it does work. It looks
like the old version of SYSCALL isn't worth supporting then, as it has
other problems with the return to user mode as well. I'm working on an
equivalent of your SYSENTER patch. One concern I have however is that
currently entry.S is a mess of spaghetti code (for good reason
unfortunately), and all these alternate syscall entries are making it
much worse.

--

Brian Gerst

Ingo Molnar

unread,

Dec 9, 1999, 3:00:00 AM12/9/99

to Brian Gerst

On Wed, 8 Dec 1999, Brian Gerst wrote:

> I just tested iret after syscall on my K6-2, and it does work. It
> looks like the old version of SYSCALL isn't worth supporting then, as
> it has other problems with the return to user mode as well. I'm
> working on an equivalent of your SYSENTER patch. One concern I have
> however is that currently entry.S is a mess of spaghetti code (for
> good reason unfortunately), and all these alternate syscall entries
> are making it much worse.

i suspect the AMD SYSCALL/SYSLEAVE instructions were changed to be
completely compatible with Intel's SYSENTER/SYSEXIT instructions? In that
case the only thing to do is to initialize the STAR MSRs in the
init_fastcall() function in my patch in the AMD case. The rest can be
shared.

-- mingo

Theodore Y. Ts'o

unread,

Dec 9, 1999, 3:00:00 AM12/9/99

to Linus Torvalds

Date: Tue, 7 Dec 1999 21:49:59 -0800 (PST)
From: Linus Torvalds <torv...@transmeta.com>

Because libc will do it every single time a process gets started.

I'm a latency person. I've tried to talk to people about pre-linking libc
at a fixed address, and avoid all the horrible run-time linking for the
case when the pre-linked address is available. So far nobody has done
this on Linux, and it makes our process startup slower.

DEC OSF/1 does this. Libraries have pre-assigned address ranges, which
are configured on a per-machine basis using a file in /etc. (Obviously,
the system gets shipped with a default address range which most machines
use, but it's not a requirement in the architecture).

For each process, when a shared library is loaded, if the default
address range is available, it gets used, and the text pages get mapped
in and shared, and there's no nead to do any run-time linking. If for
some reason that default address range is not available, then new memory
pages have to get allocated, and then library has to be relocated for
that new location. Of course, there are nice tools available for
automatically building the library VM address assignment file based on
what libraries are available on that machine, ala ldconfig.

It also means that you don't have to pay the cost of using PIC code,
and still get the benefits of using shared libraries. After the pain of
using globally allocated shared library address ranges for a.out shared
libraries, I think some folks came to the conclusion that pre-allocated
address assignment was automatically a bad thing. But using locally
pre-allocated address assignment, ala OSF/1, I think is a good idea, and
something that we might want to consider.

- Ted

M.J. Galan

unread,

Dec 9, 1999, 3:00:00 AM12/9/99

to linux-...@vger.rutgers.edu

I always found this aproach to be sensible and sound...
Best of Both Worlds.

manolow.vcf