routing packets from a tcpip.Stack to host network

37 views
Skip to first unread message

Kōshin

unread,
Sep 23, 2024, 5:28:46 PM9/23/24
to gVisor Users [Public]
Hi-

Based on the tun_tcp_echo sample (https://github.com/google/gvisor/blob/master/pkg/tcpip/sample/tun_tcp_echo/main.go), I have created a tun device and a tcpip.Endpoint that receives packets for a certain IP address and communicates back to the sender. Wonderful!

At the moment, though, the tun device is the only way for packets to get in and out of my system. I want to send packets not destined for my chosen IP out to the public internet, using my host's network interface. How, roughly speaking, do I do that?

My guess is that I need to create a second NIC with the kind of LinkEndpoint that knows how to send packets out via native linux kernel syscalls. Which LinkEndpoint implementation would I use for that? And are there any samples showing how this could be done at the level of tcpip.Stack?

Many thanks,

Kōshin

Kevin Krakauer

unread,
Sep 23, 2024, 7:25:58 PM9/23/24
to gVisor Users [Public]
Hi Kōshin,

Can you be more specific about what you mean by "packets not destined for my chosen IP out to the public internet"? It would be helpful to understand your setup in more detail.

The fdbased LinkEndpoint is typically run inside a network namespace and uses the IP address of the namespace('s device). But those packets are typically routed via iptables and come out of the host interface looking like packets from this host interface.

If you mean that you want to send packets with source address a.b.c.d from your host interface that has address w.x.y.z, I believe you could create an fdbased LinkEndpoint hooked up to an AF_PACKET socket that's been opened on the host interface. You could send packets with an arbitrary source address, and you'd receive a copy of all incoming packets.

Kevin

Kevin Krakauer

unread,
Sep 24, 2024, 4:13:03 PM9/24/24
to Kōshin, gVisor Users [Public]
This is getting a little complex for design-by-email, but I'll do my best :)

To answer the routing question: gVisor/runsc is usually run via Docker or Kubernetes, which takes care of routing packets between the host network interface and the device the sentry is hooked up to. This is a good high level diagram of how Docker does things: veth devices are used along with iptables rules (notably NAT rules) to isolate and also support port publishing and the like. It sounds like Docker won't work for what you want to do, although you could probably hack something together with iptables rules to sort of make it work.

So I think the system you want has these requirements (please correct what I get wrong):

- Your program, which I'll call the controller, is running as a normal process on the machine.
- The controller starts a subprocess in a network namespace. The subprocess is intentionally only network-isolated; otherwise it is a normal host process.
- You want all incoming traffic to the machine to reach the subprocess.
- You want all outgoing traffic from the subprocess to be filtered by the controller.
- You want the controller to utilize netstack to ingest/parse/filter/manipulate/forward/whatever traffic from the subprocess (just netstack, not full gVisor/runsc/sentry).
- You want all traffic from the subprocess to be filtered in this way, i.e. it never goes out the host NIC without passing through the controller.

If that's the case I might recommend something like this (enjoy the beautiful ASCII diagram):

     |            A                          B                C   host netns | D           E   subprocess netns
<--->|[host NIC]<--->[TAP/TUN or veth pair]<--->[controller]<---->[veth0]<---|--->[veth1]<--->[subprocess]
     |                                                                       |

A here is some iptables rules that do something like: forward all traffic (minus maybe SSH?) to a TAP/TUN device (or veth pair) that at the other end is listened to by the controller via fdbased+AF_PACKET (B). That controller has another fdbased+AF_PACKET endpoint attached at C to a veth device whose counterpart (D) is inside the network namespace and used in E by the subprocess. In this setup every single packet goes through the controller, which can forward or filter or do whatever to it. You would probably want to use NAT at point A, as if both the host and the controller are using the same IP you might get weird behavior (e.g. your controller knows how to route packets to the subprocess, but the host doesn't and it returns ICMP errors).

This can be simplified if either (1) all incoming connections are to a known port(s), e.g. you only care about packets coming in on port 80, or (2) all connections are initiated by the subprocess. Then you can just implement a basic proxy.

I'm not sure how messy this would be to implement -- it would surely take experimentation and tinkering with devices/rules. Plus this is just one idea -- you can hook up virtual devices and iptables rules in all sorts of ways.

On Tue, Sep 24, 2024 at 5:51 AM Kōshin <kos...@cml.me> wrote:
Thanks Kevin, that's helpful.

Can you be more specific about what you mean by "packets not destined for my chosen IP out to the public internet"? It would be helpful to understand your setup in more detail.

I am trying to create a program that can run another program in a new network namespace and filter packets produced by that program using rules implemented in Go. I don't want the subprocess to be isolated from the rest of the system in terms of filesystem, process ID, user ID, etc, but it should be isolated in terms of network. My approach is to create a new network namespace, create a tun device within that namespace, route everything to that tun, then exec the subprocess. The parent then receives all the packets generated by the subprocess using an fdbased LinkEndpoint on the file descriptor for the tun device, and feeds those packets into a tcpip.Stack on which I have a tcpip.Endpoint listening on one particular IP address.

The issue I'm facing is that after ingesting the packets into a tcpip.Stack, I then want to send some of them out to the host's network to be routed to the public internet (or elsewhere). I'm wondering how I can set that second part up with a tcpip.Stack: what kind of LinkEndpoint I would use for that
 

The fdbased LinkEndpoint is typically run inside a network namespace and uses the IP address of the namespace('s device).

Makes sense
 
But those packets are typically routed via iptables and come out of the host interface looking like packets from this host interface.

Interesting. When you say "those packets are typically routed via iptables", do you mean that network device visible to the container is set up, using some combination of iptables primitives, to route via the host network interface? I thought that when using runsc, sentry would take care of routing packets to/from the container in userspace, not using iptables, or perhaps making some use of iptables but still passing all packets through userspace. What I'm thinking of here is the netstack package in sentry.

 

If you mean that you want to send packets with source address a.b.c.d from your host interface that has address w.x.y.z, I believe you could create an fdbased LinkEndpoint hooked up to an AF_PACKET socket that's been opened on the host interface. You could send packets with an arbitrary source address, and you'd receive a copy of all incoming packets.

Make sense. Now that you write this, I understand that this won't work in my case, since sources address a.b.c.d won't be a public internet IP address in my case, so return packets won't arrive at w.x.y.z. Is there any way forward with the fdbased LinkEndpoint hooked up to an AF_PACKET socket? Can I have some kind of NAT layer in userspace within tcpip.Stack? If I do that, how do I correctly send these packets out of the host interface in a way that I can identify the return packets and leave my host interface usable by other processes running on the host?

Many thanks for taking the time to help with this.

 

Kevin

On Monday, September 23, 2024 at 2:28:46 PM UTC-7 Kōshin wrote:
Hi-

Based on the tun_tcp_echo sample (https://github.com/google/gvisor/blob/master/pkg/tcpip/sample/tun_tcp_echo/main.go), I have created a tun device and a tcpip.Endpoint that receives packets for a certain IP address and communicates back to the sender. Wonderful!

At the moment, though, the tun device is the only way for packets to get in and out of my system. I want to send packets not destined for my chosen IP out to the public internet, using my host's network interface. How, roughly speaking, do I do that?

My guess is that I need to create a second NIC with the kind of LinkEndpoint that knows how to send packets out via native linux kernel syscalls. Which LinkEndpoint implementation would I use for that? And are there any samples showing how this could be done at the level of tcpip.Stack?

Many thanks,

Kōshin

--
You received this message because you are subscribed to a topic in the Google Groups "gVisor Users [Public]" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gvisor-users/UwVR0SUSlY4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gvisor-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gvisor-users/512c28b8-7e6e-4bd2-a893-853aa55f60d6n%40googlegroups.com.

Kōshin

unread,
Sep 24, 2024, 8:01:25 PM9/24/24
to Kevin Krakauer, gVisor Users [Public]
Ah, this is very helpful. Actually, it is the case that all connections are initiated by the subprocess. Specifically, the requirements you wrote are all correct except the following, which is not a requirement:

> - You want all incoming traffic to the machine to reach the subprocess.

On Tue, Sep 24, 2024 at 4:13 PM Kevin Krakauer <krak...@google.com> wrote:
To answer the routing question: gVisor/runsc is usually run via Docker or Kubernetes, which takes care of routing packets between the host network interface and the device the sentry is hooked up to. This is a good high level diagram of how Docker does things: veth devices are used along with iptables rules (notably NAT rules) to isolate and also support port publishing and the like. It sounds like Docker won't work for what you want to do, although you could probably hack something together with iptables rules to sort of make it work.

I see, yes I agree.

If that's the case I might recommend something like this (enjoy the beautiful ASCII diagram):

     |            A                          B                C   host netns | D           E   subprocess netns
<--->|[host NIC]<--->[TAP/TUN or veth pair]<--->[controller]<---->[veth0]<---|--->[veth1]<--->[subprocess]
     |                                                                       |

A here is some iptables rules that do something like: forward all traffic (minus maybe SSH?) to a TAP/TUN device (or veth pair) that at the other end is listened to by the controller via fdbased+AF_PACKET (B). That controller has another fdbased+AF_PACKET endpoint attached at C to a veth device whose counterpart (D) is inside the network namespace and used in E by the subprocess. In this setup every single packet goes through the controller, which can forward or filter or do whatever to it. You would probably want to use NAT at point A, as if both the host and the controller are using the same IP you might get weird behavior (e.g. your controller knows how to route packets to the subprocess, but the host doesn't and it returns ICMP errors).

Thank you this is helpful. Yes, I understand the options available more clearly now.

Kōshin

unread,
Sep 24, 2024, 8:01:29 PM9/24/24
to Kevin Krakauer, gVisor Users [Public]
Thanks Kevin, that's helpful.

Can you be more specific about what you mean by "packets not destined for my chosen IP out to the public internet"? It would be helpful to understand your setup in more detail.

I am trying to create a program that can run another program in a new network namespace and filter packets produced by that program using rules implemented in Go. I don't want the subprocess to be isolated from the rest of the system in terms of filesystem, process ID, user ID, etc, but it should be isolated in terms of network. My approach is to create a new network namespace, create a tun device within that namespace, route everything to that tun, then exec the subprocess. The parent then receives all the packets generated by the subprocess using an fdbased LinkEndpoint on the file descriptor for the tun device, and feeds those packets into a tcpip.Stack on which I have a tcpip.Endpoint listening on one particular IP address.

The issue I'm facing is that after ingesting the packets into a tcpip.Stack, I then want to send some of them out to the host's network to be routed to the public internet (or elsewhere). I'm wondering how I can set that second part up with a tcpip.Stack: what kind of LinkEndpoint I would use for that
 
The fdbased LinkEndpoint is typically run inside a network namespace and uses the IP address of the namespace('s device).

Makes sense
 
But those packets are typically routed via iptables and come out of the host interface looking like packets from this host interface.

Interesting. When you say "those packets are typically routed via iptables", do you mean that network device visible to the container is set up, using some combination of iptables primitives, to route via the host network interface? I thought that when using runsc, sentry would take care of routing packets to/from the container in userspace, not using iptables, or perhaps making some use of iptables but still passing all packets through userspace. What I'm thinking of here is the netstack package in sentry.
 
If you mean that you want to send packets with source address a.b.c.d from your host interface that has address w.x.y.z, I believe you could create an fdbased LinkEndpoint hooked up to an AF_PACKET socket that's been opened on the host interface. You could send packets with an arbitrary source address, and you'd receive a copy of all incoming packets.
Make sense. Now that you write this, I understand that this won't work in my case, since sources address a.b.c.d won't be a public internet IP address in my case, so return packets won't arrive at w.x.y.z. Is there any way forward with the fdbased LinkEndpoint hooked up to an AF_PACKET socket? Can I have some kind of NAT layer in userspace within tcpip.Stack? If I do that, how do I correctly send these packets out of the host interface in a way that I can identify the return packets and leave my host interface usable by other processes running on the host?

Many thanks for taking the time to help with this.

 
Kevin

On Monday, September 23, 2024 at 2:28:46 PM UTC-7 Kōshin wrote:
Hi-

Based on the tun_tcp_echo sample (https://github.com/google/gvisor/blob/master/pkg/tcpip/sample/tun_tcp_echo/main.go), I have created a tun device and a tcpip.Endpoint that receives packets for a certain IP address and communicates back to the sender. Wonderful!

At the moment, though, the tun device is the only way for packets to get in and out of my system. I want to send packets not destined for my chosen IP out to the public internet, using my host's network interface. How, roughly speaking, do I do that?

My guess is that I need to create a second NIC with the kind of LinkEndpoint that knows how to send packets out via native linux kernel syscalls. Which LinkEndpoint implementation would I use for that? And are there any samples showing how this could be done at the level of tcpip.Stack?

Many thanks,

Kōshin

Kevin Krakauer

unread,
Sep 25, 2024, 2:10:59 PM9/25/24
to Kōshin, gVisor Users [Public]
Happy to help.

And yeah, if all connections are initiated by the subprocess and you don't need all incoming traffic to reach the subprocess, then it's simpler. The controller can just receive connections from the subprocess, then open a corresponding host socket and proxy bytes between them.

Kōshin

unread,
Nov 1, 2024, 8:10:13 PM11/1/24
to Kevin Krakauer, gVisor Users [Public]
I have a pretty good proof of concept of this up and running. Here it is: https://github.com/monasticacademy/httptap. It doesn't use gvisor at all right now, but it would be much better if it did. I wrote my own rudimentary tcp implementation using go-packet to parse and marshal packets one by one, and I create a TUN device in a network namespace and proxy all traffic to/from host sockets. There is a huge amount of TCP that it doesn't implement and I'd very much like to make use of the gvisor TCP stack instead of my own hand-rolled one.

What I'm wondering is: how might I set up a gvisor tcpip stack that reads packets from an fdbased LinkEndpoint and delivers all TCP connections to a certain endpoint, regardless of destination host or port? I already have the tun device created and file descriptor for it, and an fdbased link endpoint constructed around that file descriptor. I can listen on a certain IP address and port and successfully communicate with the subprocess, but I have not yet worked out how to intercept all traffic regardless of destination or port.

To be more specific, what I have at the moment is this:

... create tun device, fdbased LinkEndpoint, tcpip stack with 1 NIC ...

mystack.AddProtocolAddress(1, myAddress, stack.AddressProperties{});
...
ep, e := mystack.NewEndpoint(tcp.ProtocolNumber, proto, &wq)
...
ep.Bind(tcpip.FullAddress{Port: myPort})
...
ep.Listen(10)

for {
   conn, wq, err := ep.Accept(nil)
   ...

    go func() {
        defer conn.Close()
        ... communicate here with conn ...
    }()
}
 
This works for traffic already destined to myAddress:myPort, but how might I go about catching all TCP traffic going anywhere?

Thank you once again for any help whatsoever.

Kevin Krakauer

unread,
Nov 5, 2024, 12:41:14 PM11/5/24
to Kōshin, gVisor Users [Public]
So you want:

[ sandboxed app ] <--> [ httptap proxy ] <--> [ some other endpoint ]
All connections initiated by the sandboxed app.
All connections proxied to the other endpoint regardless of their destination IP:port.

Thankfully we have something just for this. My recommendation would be to have httptap create an fdbased endpoint stack with an AF_PACKET socket. It would be opened on the veth that connects to the sandboxed app. You would then create a forwarder and use it to set a transport protocol handler that handles all TCP traffic. So all TCP traffic will go to the handler, and the function you pass when creating the forwarder will do the connection proxying.

The GitHub repo doesn't have an example use, but in practice it looks something like this:

tcpForwarder := tcp.NewForwarder(myStack, 0, maxRequests, myHandler)
myStack.SetTransportProtocolHandler(tcp.ProtocolNumber, tcpForwarder.HandlePacket)
Reply all
Reply to author
Forward
0 new messages