syscall overhead, blocking, and file descriptors

James Bardin

unread,

Jan 30, 2014, 4:14:52 PM1/30/14

to golan...@googlegroups.com

I'm woking on a TCP proxy, and wanted to benchmark using a pipe and the linux splice syscall to avoid the userspace copy of the data.

It seems to work, in at least I can match the performance of a simple io.Copy (until I setup a better test rig).

The docs say that pulling the fd out of a net.Conn sets the underlying file into blocking mode, and I know that syscalls can be scheduled on separate threads.

Does this mean that every splice loop will probably end up in it's own OS thread?

Even if everything has its own thread, am I going to be fighting the runtime scheduler in any other ways?

If it is a problem, is there any way it could be written in a non-blocking manner in C and cooperate with the scheduler? (not that I've proven threads would actually be a problem yet)

Dave Cheney

unread,

Jan 30, 2014, 5:05:53 PM1/30/14

to James Bardin, golan...@googlegroups.com

On 31 Jan 2014, at 8:14, James Bardin <j.ba...@gmail.com> wrote:

I'm woking on a TCP proxy, and wanted to benchmark using a pipe and the linux splice syscall to avoid the userspace copy of the data.
It seems to work, in at least I can match the performance of a simple io.Copy (until I setup a better test rig).

Just match, sounds like a lot of work for no gain.

The docs say that pulling the fd out of a net.Conn sets the underlying file into blocking mode, and I know that syscalls can be scheduled on separate threads.
Does this mean that every splice loop will probably end up in it's own OS thread?

Yes, but this has nothing to do with the blocking mode of the socket. All syscalls block the thread servicing the goroutine and a new one must be found (or created) to continue execution if other goroutines while the syscall is in progress.

Even if everything has its own thread, am I going to be fighting the runtime scheduler in any other ways?

If you want to use splice(2), probably

If it is a problem, is there any way it could be written in a non-blocking manner in C and cooperate with the scheduler? (not that I've proven threads would actually be a problem yet)

Use io.Copy. If that provides in sufficient performance, profile it and then look for solutions. I'm concerned you've started eating the elephant from the wrong end.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

James Bardin

unread,

Jan 30, 2014, 5:35:16 PM1/30/14

to golan...@googlegroups.com, James Bardin

On Thursday, January 30, 2014 5:05:53 PM UTC-5, Dave Cheney wrote:

On 31 Jan 2014, at 8:14, James Bardin <j.ba...@gmail.com> wrote:

I'm woking on a TCP proxy, and wanted to benchmark using a pipe and the linux splice syscall to avoid the userspace copy of the data.
It seems to work, in at least I can match the performance of a simple io.Copy (until I setup a better test rig).

Just match, sounds like a lot of work for no gain.

Well, it was a POC. I wasn't expecting better since I'm fairly certain the VM I was using can't really make use of it (IIRC, you need DMA access to the nic to splice more than one frame at a time).

Use io.Copy. If that provides in sufficient performance, profile it and then look for solutions. I'm concerned you've started eating the elephant from the wrong end.

Nah, this was just a quick experiment. Nowadays I don't really have access to high performance physical hardware, so I can't really test this out very well anyways.

Dmitry Vyukov

unread,

Jan 31, 2014, 12:06:55 AM1/31/14

to James Bardin, golang-nuts

On Fri, Jan 31, 2014 at 1:14 AM, James Bardin <j.ba...@gmail.com> wrote:
>
> I'm woking on a TCP proxy, and wanted to benchmark using a pipe and the
> linux splice syscall to avoid the userspace copy of the data.
> It seems to work, in at least I can match the performance of a simple
> io.Copy (until I setup a better test rig).
>
> The docs say that pulling the fd out of a net.Conn sets the underlying file
> into blocking mode, and I know that syscalls can be scheduled on separate
> threads.
> Does this mean that every splice loop will probably end up in it's own OS
> thread?
> Even if everything has its own thread, am I going to be fighting the runtime
> scheduler in any other ways?

It depends at least on what else the program is doing and what is the
size of the data. If you splice 10GB of data, then I think it must be
faster with splice; if you splice 100 bytes, then do it in user space.

> If it is a problem, is there any way it could be written in a non-blocking
> manner in C and cooperate with the scheduler? (not that I've proven threads
> would actually be a problem yet)

I do not understand what you want to do in C? And why?

James Bardin

unread,

Jan 31, 2014, 9:53:28 AM1/31/14

to Dmitry Vyukov, golang-nuts

On Fri, Jan 31, 2014 at 12:06 AM, Dmitry Vyukov <dvy...@google.com> wrote:

It depends at least on what else the program is doing and what is the
size of the data. If you splice 10GB of data, then I think it must be
faster with splice; if you splice 100 bytes, then do it in user space.

Usually I'd only be aiming to splice 64k or so (larger pipes would be fine, but the architecture in go would make it hard to share them between a large number of transfers).

What you do get (esp with kernels later than 3.5, and with nics that support offload) is a zero copy transfer between sockets. This usually doesn't buy you much until you're running at 10GE speeds, and can still be difficult to tune, but there is definitely performance gains to be had.

> If it is a problem, is there any way it could be written in a non-blocking
> manner in C and cooperate with the scheduler? (not that I've proven threads
> would actually be a problem yet)

I do not understand what you want to do in C? And why?

I was thinking that if the Syscall6 calls had significant overhead, I could write the transfer loop in C so that the scheduler is only hit with one entrypoint. This still leaves me with one OS thread per transfer, which might be as good as I can get with Go.

The other idea I just had, was that I could write a proxy "server" backend in C, running in a separate thread, and send it pairs of FDs to splice together.

I was just throwing this out there to see if anyone had any insights in this area. It's not something I have the resources to work on now; only curious what might be possible.

And yes, io.Copy is going to be just fine in the meantime :)

Dmitry Vyukov

unread,

Jan 31, 2014, 10:02:41 AM1/31/14

to James Bardin, golang-nuts

On Fri, Jan 31, 2014 at 6:53 PM, James Bardin <j.ba...@gmail.com> wrote:
>
> On Fri, Jan 31, 2014 at 12:06 AM, Dmitry Vyukov <dvy...@google.com> wrote:
>>
>> It depends at least on what else the program is doing and what is the
>> size of the data. If you splice 10GB of data, then I think it must be
>> faster with splice; if you splice 100 bytes, then do it in user space.
>>
>
> Usually I'd only be aiming to splice 64k or so (larger pipes would be fine,
> but the architecture in go would make it hard to share them between a large
> number of transfers).
>
> What you do get (esp with kernels later than 3.5, and with nics that support
> offload) is a zero copy transfer between sockets. This usually doesn't buy
> you much until you're running at 10GE speeds, and can still be difficult to
> tune, but there is definitely performance gains to be had.
>
>
>>
>>
>> > If it is a problem, is there any way it could be written in a
>> > non-blocking
>> > manner in C and cooperate with the scheduler? (not that I've proven
>> > threads
>> > would actually be a problem yet)
>>
>> I do not understand what you want to do in C? And why?
>
>
>
> I was thinking that if the Syscall6 calls had significant overhead,

They do not. Non-blocking read/write syscalls must be handled very efficiently.

James Bardin

unread,

Jan 31, 2014, 10:06:22 AM1/31/14

to Dmitry Vyukov, golang-nuts

On Fri, Jan 31, 2014 at 10:02 AM, Dmitry Vyukov <dvy...@google.com> wrote:

> I was thinking that if the Syscall6 calls had significant overhead,

They do not. Non-blocking read/write syscalls must be handled very efficiently.

Good to know, and makes sense since I was able to match io.Copy without much effort.

swet...@frotz.net

unread,

Feb 17, 2015, 6:23:36 PM2/17/15

to golan...@googlegroups.com, j.ba...@gmail.com

On Friday, January 31, 2014 at 7:02:41 AM UTC-8, Dmitry Vyukov wrote:

On Fri, Jan 31, 2014 at 6:53 PM, James Bardin <j.ba...@gmail.com> wrote:
>
>
> I was thinking that if the Syscall6 calls had significant overhead,

They do not. Non-blocking read/write syscalls must be handled very efficiently.

I was poking through the call path down through Syscall() and, I dunno, it sure seems like an awful lot of code being executed to just do a (in this case) write()

I thought perhaps the "native" io primitives would be more efficient, but os.(*File).Write() just adds two more layers on top of the same callpath...

http://pastebin.com/jpjYP2AP

Obviously not all of that is executed (some bits are error paths, etc), but eyeballing it, looks like at least 200-300 instructions and quite a bit of memory traffic per syscall.

Not the end of the world, especially on 3+GHz high end desktop/server machines, but it's not nothing... but given that it seems like cgo call shims incur similar overhead, it certainly would make me want to avoid any C interfaces that need a lot of calls to do work, because this stuff does add up.

Reply all

Reply to author

Forward