How gvisor trap to syscall handler in kvm platform ?

2,023 views
Skip to first unread message

3n4...@gmail.com

unread,
May 3, 2018, 5:49:23 AM5/3/18
to gVisor Users
Hello all,
 Just curious about the tech details on how gvisor trap to syscall handler using vmx,
it grateful if you can also figure the source  code file and functions which finishing such tasks.

Thanks !


Michael Pratt

unread,
May 9, 2018, 4:23:20 PM5/9/18
to 3n4...@gmail.com, gvisor...@googlegroups.com
On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.

System calls from the sentry to the host are a bit more involved, as they require the sentry to switch from guest mode back to host mode before calling into the host kernel.

--
You received this message because you are subscribed to the Google Groups "gVisor Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users...@googlegroups.com.
To post to this group, send email to gvisor...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gvisor-users/fcd51d0b-3925-45ca-ab36-6e1049b25a47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xiang Li

unread,
May 9, 2018, 5:27:57 PM5/9/18
to gVisor Users


On Wednesday, May 9, 2018 at 1:23:20 PM UTC-7, Michael Pratt wrote:
On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.

System calls from the sentry to the host are a bit more involved, as they require the sentry to switch from guest mode back to host mode before calling into the host kernel.

Can you explain this a little bit more? When and why would sentry issue syscall to the host kernel when it is in the guest mode?

Michael Pratt

unread,
May 9, 2018, 6:43:55 PM5/9/18
to xiang...@gmail.com, gvisor...@googlegroups.com
On Wed, May 9, 2018 at 2:27 PM Xiang Li <xiang...@gmail.com> wrote:


On Wednesday, May 9, 2018 at 1:23:20 PM UTC-7, Michael Pratt wrote:
On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.

System calls from the sentry to the host are a bit more involved, as they require the sentry to switch from guest mode back to host mode before calling into the host kernel.

Can you explain this a little bit more? When and why would sentry issue syscall to the host kernel when it is in the guest mode?

​The sentry is developed as a normal user-space application​ (see "How is gVisor different from other container isolation mechanisms?" and the following Architecture section of our README). As such, it may make host system calls for many different reasons. e.g., external file system access performs read()s and write()s to a 9p server over a Unix Domain Socket. The Go runtime itself uses clone(), futex(), and mmap() (among others) for host system thread creation, synchronization primitives, and memory allocation, respectively.

The vast majority of sentry code (anything outside of pkg/sentry/platform/kvm or pkg/sentry/platform/ring0) assumes that it is a normal Linux process. Those packages are responsible for ensuring that interactions with the host (syscalls) still work properly.

 
 

On Thu, May 3, 2018 at 2:49 AM <3n4...@gmail.com> wrote:
Hello all,
 Just curious about the tech details on how gvisor trap to syscall handler using vmx,
it grateful if you can also figure the source  code file and functions which finishing such tasks.

Thanks !


--
You received this message because you are subscribed to the Google Groups "gVisor Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users...@googlegroups.com.
To post to this group, send email to gvisor...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gvisor-users/fcd51d0b-3925-45ca-ab36-6e1049b25a47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "gVisor Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users...@googlegroups.com.
To post to this group, send email to gvisor...@googlegroups.com.

Xiang Li

unread,
May 10, 2018, 2:23:30 AM5/10/18
to Michael Pratt, gvisor...@googlegroups.com
On Wed, May 9, 2018 at 3:43 PM, Michael Pratt <mpr...@google.com> wrote:


On Wed, May 9, 2018 at 2:27 PM Xiang Li <xiang...@gmail.com> wrote:


On Wednesday, May 9, 2018 at 1:23:20 PM UTC-7, Michael Pratt wrote:
On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.

System calls from the sentry to the host are a bit more involved, as they require the sentry to switch from guest mode back to host mode before calling into the host kernel.

Can you explain this a little bit more? When and why would sentry issue syscall to the host kernel when it is in the guest mode?

​The sentry is developed as a normal user-space application​ (see "How is gVisor different from other container isolation mechanisms?" and the following Architecture section of our README). As such, it may make host system calls for many different reasons. e.g., external file system access performs read()s and write()s to a 9p server over a Unix Domain Socket. The Go runtime itself uses clone(), futex(), and mmap() (among others) for host system thread creation, synchronization primitives, and memory allocation, respectively.


Thanks for the explanation. I understand the high level idea, but I am not clear how exactly it works.

So the overall architecture looks like below? 

Ring 3    User App         |     Sentry
------------------------------------------------    guest
Ring 0                Sentry.ring0

///////////////////////////////////////////////////////////////////////

Ring 3                Sentry.kvm_platform   host

Is it correct that when the user app makes a syscall, it will first be intercepted by the sentry at ring 0 in the guest. Then it will be actually handled by the Sentry emulator running at ring3 in the guest. If the Sentry emulator hits a syscall or needs some resources, it will switch to the host and be handled by the host linux?

 
The vast majority of sentry code (anything outside of pkg/sentry/platform/kvm or pkg/sentry/platform/ring0) assumes that it is a normal Linux process. Those packages are responsible for ensuring that interactions with the host (syscalls) still work properly.

 

On Thu, May 3, 2018 at 2:49 AM <3n4...@gmail.com> wrote:
Hello all,
 Just curious about the tech details on how gvisor trap to syscall handler using vmx,
it grateful if you can also figure the source  code file and functions which finishing such tasks.

Thanks !


--
You received this message because you are subscribed to the Google Groups "gVisor Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users...@googlegroups.com.
To post to this group, send email to gvisor...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gvisor-users/fcd51d0b-3925-45ca-ab36-6e1049b25a47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "gVisor Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users+unsubscribe@googlegroups.com.

Michael Pratt

unread,
May 10, 2018, 4:59:12 PM5/10/18
to Xiang Li, gvisor...@googlegroups.com
Almost, except in guest mode, the sentry always executes in ring 0. You can see the core flow here: https://github.com/google/gvisor/blob/master/pkg/sentry/platform/ring0/kernel_amd64.go#L215-L231

The sentry is normally mapped at a normal userspace address which cannot be mapped into application address spaces (since it would conflict with application mappings). So there is a sentry page table with the normal mappings, plus a mirror of relevant sentry mappings in the kernel range (bit 63 set) in all application page tables. This mirrored copy is what executes between jumpToKernel() and jumpToUser().

iret()/sysret() save RSP/RBP so that the syscall handler (sysenter()) can restore them and then "return" to the call site in SwitchToUser.

The full execution path looks like:
kernel.runApp.execute -> kernel.Task.p.Switch (kvm.context.Switch) -> kvm.vCPU.SwitchToUser -> ring0.CPU.SwitchToUser

kernel.runApp is part of the core task lifecycle state machine which handles application syscalls (eventually calling one of the handlers). The kernel package is independent of the execution platform.

Xiang Li

unread,
May 10, 2018, 8:32:00 PM5/10/18
to Michael Pratt, gvisor...@googlegroups.com
Thanks for the detailed explanation.

The sentry is normally mapped at a normal userspace address 

Is this because of the fact that sentry is developed as a normal user-space application with go runtime? If the sentry runs in ring0 with the normal userspace address, how would its own syscalls (either from go runtime or to access host resources) get trapped? Is it handled here (https://github.com/google/gvisor/blob/master/pkg/sentry/platform/ring0/entry_amd64.s#L163-L221)? It seems that CPU_KERNEL_SYSCALL is a HLT instruction for vm exit?

To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users+unsubscribe@googlegroups.com.

Michael Pratt

unread,
May 10, 2018, 8:58:57 PM5/10/18
to Xiang Li, gvisor...@googlegroups.com
On Thu, May 10, 2018 at 5:32 PM Xiang Li <xiang...@gmail.com> wrote:
Thanks for the detailed explanation.

The sentry is normally mapped at a normal userspace address 

Is this because of the fact that sentry is developed as a normal user-space application with go runtime?

​Yup​

 
If the sentry runs in ring0 with the normal userspace address, how would its own syscalls (either from go runtime or to access host resources) get trapped? Is it handled here (https://github.com/google/gvisor/blob/master/pkg/sentry/platform/ring0/entry_amd64.s#L163-L221)? It seems that CPU_KERNEL_SYSCALL is a HLT instruction for vm exit?

​Yup, tha​t's correct. Note that the SYSCALL instruction works just fine from ring 0, it just doesn't perform a ring switch since you're already in ring 0. Yup, the syscall handler executes a HLT, which is the trigger to switch back to host mode.

To see the host/guest transition internals take a look at bluepill() (switch to guest mode) and redpill() (switch to host mode) in platform/kvm. This is probably an interesting starting point.

The control flow is bit hard to follow. At a high level it goes: bluepill() -> execute CLI (allowed if already in guest mode, or ...) -> SIGILL signal handler -> bluepillHandler() -> KVM_RUN with RIP @ CLI instruction -> execute CLI in guest mode, bluepill() returns

 

Xiang Li

unread,
May 11, 2018, 2:39:00 AM5/11/18
to gVisor Users
Thanks for the hints, it is very useful for me to understand the code. Very interesting. 

This reminds me dune without kernel module involved. The "dune lib/process" is a linux emulator, and the untrusted code in ring3 is user application. I also read a little bit on the memory management part, and am wondering if gviosr is also similar to dune but implementing the Guest Physical -> Host Virtual with a software approach?

Michael Pratt

unread,
May 11, 2018, 1:35:59 PM5/11/18
to Xiang Li, gvisor...@googlegroups.com
On Thu, May 10, 2018 at 11:39 PM Xiang Li <xiang...@gmail.com> wrote:
Thanks for the hints, it is very useful for me to understand the code. Very interesting. 

This reminds me dune without kernel module involved. The "dune lib/process" is a linux emulator, and the untrusted code in ring3 is user application. I also read a little bit on the memory management part, and am wondering if gviosr is also similar to dune but implementing the Guest Physical -> Host Virtual with a software approach?

​There are some similarities to Dune, in ​particular the process-level virtualization model where the guest still utilizes Linux system calls as its interface to the host.

I don't quite follow your question about memory management. Core gVisor represents "physical memory" via host file descriptors (see pkg/sentry/platform.Memory, File) with effectively no size limitations. The KVM platform still uses EPT/NPT under-the-hood, though those are hidden behind KVM APIs.

 

Qixuan Wu

unread,
May 20, 2018, 11:04:30 PM5/20/18
to gVisor Users
As per your discussion, Sentry kernel is running in the ring0 of guest mode, so the picture should be like this, right ? 

Ring 3                User App             
------------------------------------------------    guest
Ring 0                Sentry

///////////////////////////////////////////////////////////////////////

Ring 3                Sentry.kvm_platform   host

在 2018年5月12日星期六 UTC+8上午1:35:59,Michael Pratt写道:

Qixuan Wu

unread,
May 21, 2018, 6:03:07 AM5/21/18
to gVisor Users
And if sentry is running on ring0, why they are called user-space kernel. ? 

在 2018年5月21日星期一 UTC+8上午11:04:30,Qixuan Wu写道:

Samuel Ortiz

unread,
May 21, 2018, 7:02:11 AM5/21/18
to Qixuan Wu, gVisor Users
AFIAU the sentry kernel runs on ring0 VMX non-root, which is host's
ring3/user space.

> >>>> <https://github.com/google/gvisor/blob/master/pkg/sentry/platform/kvm/machine_amd64.go#L110> is

> >>>> probably an interesting starting point.
> >>>>
> >>>> The control flow is bit hard to follow. At a high level it goes:
> >>>> bluepill() -> execute CLI (allowed if already in guest mode, or ...) ->
> >>>> SIGILL signal handler -> bluepillHandler() -> KVM_RUN with RIP @ CLI
> >>>> instruction -> execute CLI in guest mode, bluepill() returns
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> On Thu, May 10, 2018 at 1:58 PM, Michael Pratt <mpr...@google.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Almost, except in guest mode, the sentry always executes in ring 0.
> >>>>>> You can see the core flow here:
> >>>>>> https://github.com/google/gvisor/blob/master/pkg/sentry/platform/ring0/kernel_amd64.go#L215-L231
> >>>>>>
> >>>>>> The sentry is normally mapped at a normal userspace address which
> >>>>>> cannot be mapped into application address spaces (since it would conflict
> >>>>>> with application mappings). So there is a sentry page table with the normal
> >>>>>> mappings, plus a mirror of relevant sentry mappings in the kernel range
> >>>>>> (bit 63 set) in all application page tables. This mirrored copy is what
> >>>>>> executes between jumpToKernel() and jumpToUser().
> >>>>>>
> >>>>>> iret()/sysret() save RSP/RBP so that the syscall handler (sysenter())
> >>>>>> can restore them and then "return" to the call site in SwitchToUser.
> >>>>>>
> >>>>>> The full execution path looks like:
> >>>>>> kernel.runApp.execute -> kernel.Task.p.Switch (kvm.context.Switch) ->
> >>>>>> kvm.vCPU.SwitchToUser -> ring0.CPU.SwitchToUser
> >>>>>>
> >>>>>> kernel.runApp is part of the core task lifecycle state machine which
> >>>>>> handles application syscalls (eventually calling one of the handlers

> >>>>>> <https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/linux64.go#L48>).

> >>>>>> The kernel package is independent of the execution platform.
> >>>>>>
> >>>>>> On Wed, May 9, 2018 at 11:23 PM Xiang Li <xiang...@gmail.com> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, May 9, 2018 at 3:43 PM, Michael Pratt <mpr...@google.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, May 9, 2018 at 2:27 PM Xiang Li <xiang...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wednesday, May 9, 2018 at 1:23:20 PM UTC-7, Michael Pratt wrote:
> >>>>>>>>>>
> >>>>>>>>>> On the KVM platform, system call interception works much like a
> >>>>>>>>>> normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system
> >>>>>>>>>> call handler

> >>>>>>>>>> <https://github.com/google/gvisor/blob/master/pkg/sentry/platform/ring0/entry_amd64.go#L23-L32>,

> >>>>>>>>>> which is invoked whenever an application (or the sentry itself) executes a
> >>>>>>>>>> SYSCALL instruction.
> >>>>>>>>>>
> >>>>>>>>>> System calls from the sentry to the host are a bit more involved,
> >>>>>>>>>> as they require the sentry to switch from guest mode back to host mode
> >>>>>>>>>> before calling into the host kernel.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Can you explain this a little bit more? When and why would sentry
> >>>>>>>>> issue syscall to the host kernel when it is in the guest mode?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> ​The sentry is developed as a normal user-space application​ (see "How
> >>>>>>>> is gVisor different from other container isolation mechanisms?"

> >>>>>>>> <https://github.com/google/gvisor#how-is-gvisor-different-from-other-container-isolation-mechanisms> and

> >>>>>>>>>>> <https://groups.google.com/d/msgid/gvisor-users/fcd51d0b-3925-45ca-ab36-6e1049b25a47%40googlegroups.com?utm_medium=email&utm_source=footer>


> >>>>>>>>>>> .
> >>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
> >>>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>> You received this message because you are subscribed to the Google
> >>>>>>>>> Groups "gVisor Users" group.
> >>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
> >>>>>>>>> send an email to gvisor-users...@googlegroups.com.
> >>>>>>>>> To post to this group, send email to gvisor...@googlegroups.com.
> >>>>>>>>> To view this discussion on the web visit
> >>>>>>>>> https://groups.google.com/d/msgid/gvisor-users/0225cff8-0249-488e-94a4-2edb71b6c55d%40googlegroups.com

> >>>>>>>>> <https://groups.google.com/d/msgid/gvisor-users/0225cff8-0249-488e-94a4-2edb71b6c55d%40googlegroups.com?utm_medium=email&utm_source=footer>


> >>>>>>>>> .
> >>>>>>>>> For more options, visit https://groups.google.com/d/optout.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>> --
> >>> You received this message because you are subscribed to the Google
> >>> Groups "gVisor Users" group.
> >>> To unsubscribe from this group and stop receiving emails from it, send
> >>> an email to gvisor-users...@googlegroups.com.
> >>> To post to this group, send email to gvisor...@googlegroups.com.
> >>> To view this discussion on the web visit

> >>> https://groups.google.com/d/msgid/gvisor-users/e620e264-6eb9-4eef-993c-02b8ac182a9c%40googlegroups.com
> >>> <https://groups.google.com/d/msgid/gvisor-users/e620e264-6eb9-4eef-993c-02b8ac182a9c%40googlegroups.com?utm_medium=email&utm_source=footer>


> >>> .
> >>> For more options, visit https://groups.google.com/d/optout.
> >>>
> >>
>
> --
> You received this message because you are subscribed to the Google Groups "gVisor Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users...@googlegroups.com.
> To post to this group, send email to gvisor...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/gvisor-users/ef88a9c7-f0c1-47e8-b793-eaa5e609a6ff%40googlegroups.com.


> For more options, visit https://groups.google.com/d/optout.

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number: 302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Michael Pratt

unread,
May 21, 2018, 12:43:41 PM5/21/18
to samuel...@intel.com, wuqi...@gmail.com, gvisor...@googlegroups.com

On Mon, May 21, 2018 at 4:02 AM
​​
Samuel Ortiz <samuel...@intel.com> wrote:
AFIAU the sentry kernel runs on ring0 VMX non-root, which is host's
ring3/user space.

Most importantly, gVisor (in either ptrace or kvm mode) depends on normal system calls to the host kernel like any other user-space program, rather than virtualized block devices, NICs, etc like a standard virtual machine. KVM provides access to Intel VMX/AMD SVM hardware virtualization features which are used primarily for fast system call interception. We don't use KVM to build a full "virtual machine" with other virtual hardware like a standard VMM.

On Mon, May 21, 2018 at 03:03:06AM -0700, Qixuan Wu wrote:
> And if sentry is running on ring0, why they are called user-space kernel. ?
>
> 在 2018年5月21日星期一 UTC+8上午11:04:30,Qixuan Wu写道:
> >
> > As per your discussion, Sentry kernel is running in the ring0 of guest
> > mode, so the picture should be like this, right ?
> >
> > Ring 3                User App             
> > ------------------------------------------------    guest
> > Ring 0                Sentry
> >
> > ///////////////////////////////////////////////////////////////////////
> >
> > Ring 3                Sentry.kvm_platform   host

​Yes, this is more correct. However, it should be noted that host -> guest and guest -> host transitions only occur as necessary (app execution must occur in guest mode, host syscalls must occur in host mode). Otherwise, the sentry remains in whichever mode it happens to be in as long as possible. ​This means that most sentry operations (e.g., handling a read from tmpfs) may occur in either host or guest mode.

 

Qixuan Wu

unread,
May 21, 2018, 11:51:39 PM5/21/18
to gVisor Users


在 2018年5月22日星期二 UTC+8上午12:43:41,Michael Pratt写道:

On Mon, May 21, 2018 at 4:02 AM
​​
Samuel Ortiz <samuel...@intel.com> wrote:
AFIAU the sentry kernel runs on ring0 VMX non-root, which is host's
ring3/user space.

Most importantly, gVisor (in either ptrace or kvm mode) depends on normal system calls to the host kernel like any other user-space program, rather than virtualized block devices, NICs, etc like a standard virtual machine. KVM provides access to Intel VMX/AMD SVM hardware virtualization features which are used primarily for fast system call interception. We don't use KVM to build a full "virtual machine" with other virtual hardware like a standard VMM. 

OK I got, so that's why people said it's a user-space kernel. 
  

On Mon, May 21, 2018 at 03:03:06AM -0700, Qixuan Wu wrote:
> And if sentry is running on ring0, why they are called user-space kernel. ?
>
> 在 2018年5月21日星期一 UTC+8上午11:04:30,Qixuan Wu写道:
> >
> > As per your discussion, Sentry kernel is running in the ring0 of guest
> > mode, so the picture should be like this, right ?
> >
> > Ring 3                User App             
> > ------------------------------------------------    guest
> > Ring 0                Sentry
> >
> > ///////////////////////////////////////////////////////////////////////
> >
> > Ring 3                Sentry.kvm_platform   host

​Yes, this is more correct. However, it should be noted that host -> guest and guest -> host transitions only occur as necessary (app execution must occur in guest mode, host syscalls must occur in host mode). Otherwise, the sentry remains in whichever mode it happens to be in as long as possible. ​This means that most sentry operations (e.g., handling a read from tmpfs) may occur in either host or guest mode.

OK, I got. Sentry doesn't care about and doesn't know where it is.  

Xinyang Ge

unread,
Sep 14, 2023, 12:44:30 AM9/14/23
to gVisor Users [Public]
Most importantly, gVisor (in either ptrace or kvm mode) depends on normal system calls to the host kernel like any other user-space program

In KVM mode, would the system call made by Sentry (in the guest ring0) be translated by the Sentry's VMM code in the host ring3?  For example, if the Go runtime in the guest ring0 makes a clone() system call, I would imagine Sentry's VMM code in the host ring3 to make KVM ioctl to create a new vCPU to mimic the effect of a new thread in guest ring0, instead of passing the original clone() to the host kernel unmodified.  Is my understanding correct here?

Xinyang

Xinyang Ge

unread,
Sep 14, 2023, 1:31:14 AM9/14/23
to gVisor Users [Public]
Yes, this is more correct. However, it should be noted that host -> guest and guest -> host transitions only occur as necessary (app execution must occur in guest mode, host syscalls must occur in host mode). Otherwise, the sentry remains in whichever mode it happens to be in as long as possible. ​This means that most sentry operations (e.g., handling a read from tmpfs) may occur in either host or guest mode.

Also, is the sentry code along with the Go runtime mapped in both guest ring0 and host ring3, and the execution of the sentry code can happen in either guest ring0 or host ring3?  If so, how does it cope with the pointer differences (e.g., user-mode pointer has bit 63 cleared while kernel-mode pointer has bit 63 set)?

Xinyang Ge

unread,
Sep 15, 2023, 11:13:30 AM9/15/23
to gVisor Users [Public]
I think I know how it works after reading the source code:

The sentry in host ring3 maps itself into the guest by reading its own memory map at /proc/self/maps and calling kvm’s set user memory region ioctl.  The sentry maps itself to two aliased guest virtual addresses (GVAs): (1) the same address as on the host, and (2) a high kernel-like address (0x7fff -> 0xffff).  (2) is mapped into the sandbox app’s guest page table, and (1) is in sentry’s own guest page table (to avoid GVA conflict with the app).  When the app makes a system call, it first goes to Sentry’s system call handler in (2), which immediately switches cr3 to sentry’s page table and jumps to the same instruction at the aliased lower address (7ffff…) while still running at guest ring0.  This makes the sentry's execution no different from running at host ring3, until it needs to make a real system call to the host.  In such a case, the syscall handler detects the call is made from guest ring0, and executes HLT to VMexit to the host.  The sentry at host ring3 will receive the control, and sync its own register context from the vCPU’s register state to retry the exactly same SYSCALL instruction, but now at the host ring3 mode this time.  Now everything can continue.  Once the sentry at host ring3 finishes its emulation and is ready to go back to the sandbox app, it will probe if it’s in the guest mode by executing a CLI instruction.  CLI is a privileged instruction only available to ring0, so the sentry at host ring3 will receive a SIGILL, and the signal handler will sync the host register state back to the vCPU’s register context and call KVM’s vcpu run ioctl to switch the execution mode to guest ring0 — the sentry now starts executing at guest ring0 with the same RIP pointing to the same CLI instruction.  When it needs to sysret to the sandbox app, it will jump to the aliased high address (0xffff…), switch cr3 to the sandbox app’s page table, and then finally executes the sysret instruction to resume the sandbox.

Please let me know if there's any inaccuracy in my understanding.
Reply all
Reply to author
Forward
0 new messages