Native code 'sandboxes'

dmitry.z...@gmail.com

unread,

Mar 9, 2009, 9:18:54 AM3/9/09

to Phantom OS

It is obvious that Phantom will need some way to bring native code to
the system. For example, some heavy number crunching will be more
effective in native code, than in VM. At least for some time. I
supposed to implement it like this: Create a special class, which
objects will contain native code per se, and be able to run them in a
special environment. On Intel 32 it would be 32-bit code segment which
is restricted to the object's contents, and 2-4 data segments which
are objects of type .internal.binary and are mapped through DS+SS, ES,
FS and GS segments. So that this code will be able to address them
directly and still not be able to do some harm to the rest of the OS.

The question is - what to do in Amd64 world? Its 64-bit mode does not
support segments anymore, and I don't wont to return to adress map
switching for it is slow and the rest of Phantom does not need it.

One possible solution is devote one or few CPUs to native code
execution and have CR3 reloaded only on these CPUs, and the rest of
CPUs will work in usual Phantom environment and switch processes
without CR3 reload.

It is worth to mention here that it is possible that long-term (slow)
GC code will live in a special address space as well, so we will
possibly need separate address spaces support anyway.

tomato

unread,

Mar 9, 2009, 12:11:50 PM3/9/09

to Phantom OS

On Mar 9, 6:18 am, "dmitry.zavalis...@gmail.com"

<dmitry.zavalis...@gmail.com> wrote:
> It is obvious that Phantom will need some way to bring native code to
> the system. For example, some heavy number crunching will be more
> effective in native code, than in VM. At least for some time. I
> supposed to implement it like this: Create a special class, which
> objects will contain native code per se, and be able to run them in a
> special environment. On Intel 32 it would be 32-bit code segment which
> is restricted to the object's contents, and 2-4 data segments which
> are objects of type .internal.binary and are mapped through DS+SS, ES,
> FS and GS segments. So that this code will be able to address them
> directly and still not be able to do some harm to the rest of the OS.

Hi Dmitry,

I would strongly advise against native code support in 32-bit land.
32 bits is on the way out -- even in the embedded space -- and the
security considerations that result could plague Phantom for years,
and distract it from its main objectives. If people care enough about
performance to write native code, then they should just buy a 64-bit
CPU. In particular, many operations are legal even at Privilege Level
3 which can wreak havoc, e.g. INT and (sometimes) RDMSR or RDPMC.

>
> The question is - what to do in Amd64 world? Its 64-bit mode does not
> support segments anymore, and I don't wont to return to adress map
> switching for it is slow and the rest of Phantom does not need it.
>
> One possible solution is devote one or few CPUs to native code
> execution and have CR3 reloaded only on these CPUs, and the rest of
> CPUs will work in usual Phantom environment and switch processes
> without CR3 reload.

I would keep the OS design as plastic as possible; I would not
recommend dedicating cores for native code execution. Nor would I
recommend restricting the cores which may accept hardware interrupts.

Here's what I'd suggest, in the interest of maximum simplicity and
security:

1. By default, on each core, mark all pages as Global (Page Table
Entry bit 8) and Supervisor (bit 2).

2. When native code wants to execute, it must tell the kernel the base
and size of (a) its code and (b) its data area. These values must be
page-granular (4KB or 2MB or whatever).

3. The kernel will then mark regions 2a and 2b as NOT Global and NOT
Supervisor.

4. The native code executes at PL3, from the base of region 2a. If it
dares to access a Supervisor page, it will be Page Faulted to hell.
(Note that "Supervisor" includes the userland pages of all other
objects.)

5. An interrupt arrives. The CPU context-switches back into the
kernel, and as a result, must reload CR3. HOWEVER, because most pages
are Global, there will be a minimal impact on the contents of the
Translation Lookaside Buffer. The task switching penalty with thus be
lessened.

Now, let's say you have 100 native code-data sections running
concurrently. No problem. Each of the 100 has its own address map,
in accordance with the design above. However, since most of levels of
the page directories are identical, you can reuse most of them, i.e.
you don't need 100 complete, parallel copies of all PTEs. Note that
under this scheme, Native Process A cannot access the pages of Native
Process B, and vice-versa.

Perhaps you should provide for 2 data areas, one of which being
sharable with other objects. This would be necessary for SMP. One
could also imagine 2 code areas, one for shared code, and one for JIT
code. Anyhow, the difference is trivial. Frankly, you could allow
for X code regions and Y data regions.

Critically, the kernel must verify that the native code and data areas
lie within the memory space allocated to the process/VM/object in
question, including the usual checks for address wrap and alignment
violations. Also, the kernel must verify that no regions overlap,
other than is allowed by normal interprocess address space sharing.
Code pages should be marked readonly. However, it might make sense to
allow the data pages to be executable, which would make JIT writers
very happy and theoretically would not weaken security, on account of
the page protections already in place.

Caveats:

1. Don't forget the stack. I think this is a solvable problem, but
you have to make sure that the PL3 stack can't access the PL0 stack,
while still being able to call the kernel. (Perhaps the ONLY kernel
call allowed while in "Native Mode" should be "Exit Native Mode".)

2. Evil native code can issue INTs, creating spurious hardware
activity. This should be acceptable, as in the real world, spurious
interrupts can happen due to noise. But driver-writers should be
aware of this threat, and code their IOs defensively.

>
> It is worth to mention here that it is possible that long-term (slow)
> GC code will live in a special address space as well, so we will
> possibly need separate address spaces support anyway.

Sadly, this doesn't help you. Why? Because if you have 100 different
native code snippets and 200 different data areas, then you still need
to enforce protection between them. You can't let Code Snippet for
Process A touch Data Area for Process B, even though they're both
Native Mode processes. So this leads to the very problem that we just
solved above, using the Global and Supervisor bits. Sorry, you need
to reload CR3.

Alternative:

Restrict the type of instructions that can execute in Native Mode. Be
so restrictive that security analysis becomes trivial. For example,
the code must keep a loop count in RCX, and the source and destination
base pointers in RSI and RDI, respectively. Also, it might have some
restrictions on branch instructions. This isn't as tyrannical as it
sounds, and it allows you to sidestep this whole disgusting problem.

Believe it or not, this is my favorite alternative. Why? Because you
can start with extreme restrictions, then become more liberal as
Phantom evolves. (Just have a "Get Native Support Version Number"
kernel call.) BUT, you will NEVER have to think about CR3 hell ever
again. AND, you can provide some massive acceleration with even
highly-restricted operations. Think about it... most of the cycles in
the world are DSP-ish. They aren't branch-intensive pieces of code.
They sit there doing matrix math all day long. It's a lot of
calculations, but the algorithm is simple and stupid and repetitive.

On the other hand, if you solve this problem the CR3 way, then you'll
NEVER get away from it. You'll have this asshole legacy following you
forever, even after Phantom's VM overhead becomes trivial. (Some VMs
these days have NEGATIVE overhead, so don't be surprised if this
occurs at some point to Phantom.)

Anyway, I'm pleased that you're thinking about this. It shows that
you give a damn about performance, and also security!

Tomato

dmitry.z...@gmail.com

unread,

Mar 9, 2009, 5:15:50 PM3/9/09

to Phantom OS

Thank you for a detailed review.

> Here's what I'd suggest, in the interest of maximum simplicity and
> security:

I thought about this approach, and see some problems about it:

> 1. By default, on each core, mark all pages as Global (Page Table
> Entry bit 8) and Supervisor (bit 2).
>
> 2. When native code wants to execute, it must tell the kernel the base
> and size of (a) its code and (b) its data area. These values must be
> page-granular (4KB or 2MB or whatever).

Code and data area are supposed to be usual Phantom objects. It means
that those objects initially are of any size from 0 bytes and are not
aligned. Thought, just writing this sentence I found possible
solution: Actual Phantom object has to be two objects inside: one is
'proxy', which is available to Phantom code, and other is actual byte
buffer which can be moved to page boundary and resized if used in
native environment.

Thank you, I believe today we did some real progress on this subject!

tomato

unread,

Mar 9, 2009, 9:43:14 PM3/9/09

to Phantom OS

On Mar 9, 2:15 pm, "dmitry.zavalis...@gmail.com"

<dmitry.zavalis...@gmail.com> wrote:
> Code and data area are supposed to be usual Phantom objects. It means
> that those objects initially are of any size from 0 bytes and are not
> aligned. Thought, just writing this sentence I found possible
> solution: Actual Phantom object has to be two objects inside: one is
> 'proxy', which is available to Phantom code, and other is actual byte
> buffer which can be moved to page boundary and resized if used in
> native environment.

This could work, but you need CR3, right? Otherwise, what is to stop
the code from addressing any memory location?

Tomato

Dmitry Zavalishin

unread,

Oct 24, 2011, 10:53:01 AM10/24/11

to phant...@googlegroups.com

Hi,

Long time no talk. Glad to tell that we already have (not really ideal, but working) native subsystem in Phantom.
And Phantom is opensource project now.

http://code.google.com/p/phantomuserland/

How about joining us?

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Phantom OS" group.
To post to this group, send email to phant...@googlegroups.com
To unsubscribe from this group, send email to phantom-os+...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/phantom-os?hl=en
-~----------~----~----~----~------~----~------~--~---

--

dz

Reply all

Reply to author

Forward