Getting rid of %fs

Michael Elizabeth Chastain

unread,

Jul 4, 1996, 3:00:00 AM7/4/96

to

I was reading 'Life after 2.0' by Linus Torvalds, available here:

http://linux.ucs.indiana.edu/hypermail/linux/kernel/9606/0765.html

One of the ideas here is to get rid of %fs as the separate segment
descriptor for user space; so that, for example, memcpy and
memcpy_touser will share the same implementation (although conceptually
of course they will be very different and code should still make the
distinction).

I think there's a reliability and security risk here. Any part of the
kernel that does a memcpy_touser can now also write on kernel data space
if the surrounding code fails to validate its pointer correctly. Right
now, this can corrupt the calling process or cause an oops, but it can't
ever kill the kernel. If memcpy_fromuser and memcpy_touser are changed
to have physical access to kernel data space, then such code can corrupt
the kernel, or even become an access point for user programs to write
over controlled sections of kernel data space.

I personally think this is sufficient reason to stay with the existing
protection system; but whatever happens with this part of the VM system,
I hope the designers take this risk into account.

Michael Chastain
m...@duracef.shout.net

Linus Torvalds

unread,

Jul 5, 1996, 3:00:00 AM7/5/96

to

In article <4rfh9j$f...@treflan.shout.net>,

Michael Elizabeth Chastain <m...@treflan.shout.net> wrote:
>I was reading 'Life after 2.0' by Linus Torvalds, available here:
>
> http://linux.ucs.indiana.edu/hypermail/linux/kernel/9606/0765.html
>
>One of the ideas here is to get rid of %fs as the separate segment
>descriptor for user space; so that, for example, memcpy and
>memcpy_touser will share the same implementation (although conceptually
>of course they will be very different and code should still make the
>distinction).
>
>I think there's a reliability and security risk here. Any part of the
>kernel that does a memcpy_touser can now also write on kernel data space
>if the surrounding code fails to validate its pointer correctly. Right
>now, this can corrupt the calling process or cause an oops, but it can't
>ever kill the kernel.

Indeed. That was one of the reasons to use a special segment for user
code originally: the segments made it possible to do some of the limit
checking in hardware..

However, this is much less of an issue these days, because for other
reasons (much more complete memory management), we always have to
validate any user mode pointers _anyway_ with verify_area() (in the
original Linux setup you didn't need to verify pointers at all, and the
hardware took care of it and trapped, but that will result in some very
bad behaviour when the traps happen in critical places).

So yes, getting rid of %fs makes it more likely that buggy kernel code
can blow up, but I have personally always felt that you shouldn't cater
overmuch to buggy kernel code because (a) - it shouldn't exist, and (b)
buggy code is buggy, and it's quite as likely to corrupt the kernel some
other way anyway.

Getting rid of %fs can potentially be a quite noticeable performance
advantage, which is why I think it will pay off. We lose some hardware
checking (that shouldn't be needed anyway), but we gain the possibility
of having a lot faster user-mode accesses from within the kernel. For
performance, using segmentation for user mode accesses mean:

- the segment override adds one byte to the instruction (negligible in
itself)
- a segment override adds one cycle to the instruction _and_ makes the
pipeline behave badly. This is especially noticeable for small
memory copies (copying a structure to/from user space), which is
actually done a lot in various system calls.
- kernel entry is slower with %fs because we have to set it up (and
loading a segment register is not only slow, it's also serializing,
ugh).
- We need to duplicate x86-specific code, and the user mode versions of
the code can't be streamlined as well. For an example of this, just
look at the memcpy_and_csum[_fromfs]() stuff to see what I mean.
- because we have to use inline assembly for the low-level operation,
we can't get gcc to optimize any of the accesses.

I agree with your concerns, but I essentially think that we want to get
rid of %fs regardles.

I actually have a second "hidden agenda" with this all: this change will
make Linux/x86 look a whole lot more like the other Linux/xxx ports when
it comes to hardware accesses. There are lots of device drivers which
just assume that you can use hardware shared memory pointers directly
for kernel access, when you should really do a "memcpy_[to|from]io()" or
a read[bwl]()/write[bwl]() for portability reasons. Getting rid of %fs
will also force us to relocate the kernel virtually and thus make sure
that people write more portable code even if they don't have a PowerPC
or alpha machine..

See, you not only have to be a good coder to create a system like Linux,
you have to be a sneaky bastard too ;-)

Linus

Andreas Schwab

unread,

Jul 5, 1996, 3:00:00 AM7/5/96

to

In article <4rfh9j$f...@treflan.shout.net>, m...@treflan.shout.net (Michael Elizabeth Chastain) writes:

|> I was reading 'Life after 2.0' by Linus Torvalds, available here:
|> http://linux.ucs.indiana.edu/hypermail/linux/kernel/9606/0765.html

|> One of the ideas here is to get rid of %fs as the separate segment
|> descriptor for user space; so that, for example, memcpy and
|> memcpy_touser will share the same implementation (although conceptually
|> of course they will be very different and code should still make the
|> distinction).

Don't do this! On other CPU's, like the m68k, the kernel address
space is completely separate from the user address space (they are
defined by separate mmu translation trees), so it would be impossible
to follow this change.
--
Andreas Schwab "And now for something
sch...@issan.informatik.uni-dortmund.de completely different"

Linus Torvalds

unread,

Jul 6, 1996, 3:00:00 AM7/6/96

to

In article <vyzenmr...@lamothe.informatik.uni-dortmund.de>,

Andreas Schwab <sch...@issan.informatik.uni-dortmund.de> wrote:
>In article <4rfh9j$f...@treflan.shout.net>, m...@treflan.shout.net (Michael Elizabeth Chastain) writes:
>
>|> I was reading 'Life after 2.0' by Linus Torvalds, available here:
>|> http://linux.ucs.indiana.edu/hypermail/linux/kernel/9606/0765.html
>
>|> One of the ideas here is to get rid of %fs as the separate segment
>|> descriptor for user space; so that, for example, memcpy and
>|> memcpy_touser will share the same implementation (although conceptually
>|> of course they will be very different and code should still make the
>|> distinction).
>
>Don't do this! On other CPU's, like the m68k, the kernel address
>space is completely separate from the user address space (they are
>defined by separate mmu translation trees), so it would be impossible
>to follow this change.

Don't worry - I'm not getting rid of the "copy_to_user()" functions. I'm
just changing the implementation of that function on the x86, because on
the x86 we can do it more efficiently without separate segments.

In short, those architectures that require separate segments (ironic to
think of a x86 as less segmented than a 68k, but true) will not be
bothered. As Michael wrote (and it should be very clear in my "Life
after 2.0" too), the _concept_ of a separate user space segment is still
there, it's just that the x86 doesn't actually use it (the same way the
alpha, sparc and PowerPC ports do not have any separate user segment,
even though the code uses "put_user()" and "get_user()" to access user
space).

Linus

Michael Elizabeth Chastain

unread,

Jul 6, 1996, 3:00:00 AM7/6/96

to

In article <vyzenmr...@lamothe.informatik.uni-dortmund.de>,
Andreas Schwab <sch...@issan.informatik.uni-dortmund.de> wrote:
> Don't do this! On other CPU's, like the m68k, the kernel address
> space is completely separate from the user address space (they are
> defined by separate mmu translation trees), so it would be impossible
> to follow this change.

As I understand the proposal, kernel code will still call memcpy_fromfs
and memcpy_tofs. On i386, these functions will have trivial
implementations, similar to the current Sparc implementation. On m68k,
these functions will continue to have non-trivial implementations.

There's a risk that people working on the i386 port will neglect to call
the right functions and their code will work on the i386 but break on
any other port.

We can address this by documentation, either through the Documentation/
directory or a section in the Kernel Hacker's Guide. I volunteer to
write some. Suggestions about where such documentation belongs and what
should be in it welcomed.

Michael Chastain
m...@duracef.shout.net

Message has been deleted

Andreas Schwab

unread,

Jul 8, 1996, 3:00:00 AM7/8/96

to

In article <4rlhp3$c...@linux.cs.Helsinki.FI>, torv...@linux.cs.Helsinki.FI (Linus Torvalds) writes:

|> In article <vyzenmr...@lamothe.informatik.uni-dortmund.de>,
|> Andreas Schwab <sch...@issan.informatik.uni-dortmund.de> wrote:

|>> In article <4rfh9j$f...@treflan.shout.net>, m...@treflan.shout.net (Michael Elizabeth Chastain) writes:
|>>
|>> |> I was reading 'Life after 2.0' by Linus Torvalds, available here:
|>> |> http://linux.ucs.indiana.edu/hypermail/linux/kernel/9606/0765.html
|>>
|>> |> One of the ideas here is to get rid of %fs as the separate segment
|>> |> descriptor for user space; so that, for example, memcpy and
|>> |> memcpy_touser will share the same implementation (although conceptually
|>> |> of course they will be very different and code should still make the
|>> |> distinction).
|>>

|>> Don't do this! On other CPU's, like the m68k, the kernel address
|>> space is completely separate from the user address space (they are
|>> defined by separate mmu translation trees), so it would be impossible
|>> to follow this change.

|> Don't worry - I'm not getting rid of the "copy_to_user()" functions. I'm

|> just changing the implementation of that function on the x86, because on
|> the x86 we can do it more efficiently without separate segments.

Thanks, that should be ok. Actually it would be possible to do a
similar change on the m68k (by making part of the user address space
accessible in supervisor mode only), but that would only work on the
68020 and 68030, not on 68040+ which only have a crippled mmu.

|> In short, those architectures that require separate segments (ironic to
|> think of a x86 as less segmented than a 68k, but true) will not be

"Segments" is not the right term, there are just separate address
spaces (user mode vs. supervisor mode), both covering the full 32 bit
range. You could even make separate translations for code and data
(on '020 and '030), for a total of 16GB address space (hey, can the
i386 do this? :-) )

Andreas.

Michael K. Johnson

unread,

Jul 8, 1996, 3:00:00 AM7/8/96

to

On 6 Jul 1996 16:32:22 GMT, Michael Elizabeth Chastain <m...@treflan.shout.net> wrote:
>There's a risk that people working on the i386 port will neglect to call
>the right functions and their code will work on the i386 but break on
>any other port.
>
>We can address this by documentation, either through the Documentation/
>directory or a section in the Kernel Hacker's Guide. I volunteer to
>write some. Suggestions about where such documentation belongs and what
>should be in it welcomed.

I definitely think that it ought to be documented in the Kernel Hackers'
Guide, wherever else it is documented. People appear to be actually
reading it (over 8000 unique sites have accessed it since the web site
was started 2.5 months ago) and so some people might actually pay
attention to it. I've also heard queries from people on exactly how
they are supposed to use all the different memory calls--not just
the *user* functions, but also the bus/phys/virt functions. Write
the documentation and I'll be glad to put it up in the KHG!

Thanks!

michaelkjohnson

"Ever wonder why the SAME PEOPLE make up ALL the conspiracy theories?"

Marc Lefranc

unread,

Jul 15, 1996, 3:00:00 AM7/15/96

to

In article <slrn4u2ov4....@redhat.com> john...@redhat.com
(Michael K. Johnson) writes:

While we are at it, is there a way to get the PHYSICAL address of a
buffer in user space from driver code (such as the buf argument of a
read( ) function? I am asking this because I am trying to write a
driver for a PCI data acquisition board. This board supports PCI bus
mastering, for this I just have to give a count and a physical
address to the PCI controller on the board. I know I can copy to
kernel memory space and then do a memcpy_tofs. However, since the
maximum rate is 20Mbytes/s, I would rather copy the data directly to
user space. I once came across a virttophys() function, but this seems
to be intended for kernel space only (i.e. It just does a 1:1
mapping).
Thanks in advance for any pointer.

P.S. I am aware of the fragmentation problem. Buffers in user space
would be mlocked. Fortunately, transfers have to be done by halfs of
the board's FIFO, which is just the size of a page (4Kb).