anyone compared c/c++ sockets to java nio ?

ymo

未读，

2015年2月18日 10:50:542015/2/18

收件人 mechanica...@googlegroups.com

Were you ever forced to use native sockets instead of nio because of performance ? I am addressing this to people in hft in particular.

Jimmy Jia

未读，

2015年2月18日 11:02:452015/2/18

收件人 mechanica...@googlegroups.com

The snarky answer is that if you care about performance in that context, you probably shouldn't be using kernel sockets anyway, and at least be using a shim for stack bypass.

On Wed, Feb 18, 2015, 10:50 ymo <ymol...@gmail.com> wrote:

Were you ever forced to use native sockets instead of nio because of performance ? I am addressing this to people in hft in particular.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Thompson

未读，

2015年2月18日 13:40:052015/2/18

收件人 mechanica...@googlegroups.com

Not so snarky ;-) Actually a good point. If using Java on Linux and you need low latency comms then Solarflare with Open Onload, or Mellanox are good options.

A really key thing to reduce latency is to batch up the expensive operations to amortise the cost. Consider micro burst scenarios. A great example is using sendmmsg() and recvmmsg(). So if you get to batch 2+ frames for each system call then you beat going via a user space stack.

Next logical step is to do JNI and go to a native API for a user space stack and use batch semantics.

The main advantage to the likes of Onload is the avoidance of kernel introduced jitter on top of its faster path.

Vitaly Davidovich

未读，

2015年2月18日 14:01:562015/2/18

收件人 mechanica...@googlegroups.com

Don't recall if this was already posted to this list before, but either way, here's an interesting presentation (linked to from the lwn article) on the topic of kernel network stack: http://lwn.net/Articles/629155/

Martin Thompson

未读，

2015年2月18日 14:47:052015/2/18

收件人 mechanica...@googlegroups.com

Yeah it is a good article, I think you or someone else posted it before. API/protocols sooooooo need to go async and support batching if we are to take further steps in performance.

Vitaly Davidovich

未读，

2015年2月18日 14:52:062015/2/18

收件人 mechanica...@googlegroups.com

Yeah, could've been me :)

+1 on batching in particular. In fact, there was a follow-on talk at LCA 2015 on the kernel memory manager (http://lwn.net/Articles/629152/ -- pretty sure this one hasn't been posted :)) which is also looking at adding batching to kernel mem alloc routines (partly driven from the networking stack demands).

Jimmy Jia

未读，

2015年2月18日 15:04:332015/2/18

收件人 mechanica...@googlegroups.com

Martin/Vitaly:

Batching is something of a latency/throughput tradeoff, no? If you batch the send, you might get better latency for the 2nd message, but worse latency for the 1st message. It really then depends on your use case.

But in an HFT context, the end point of optimizing for latency doesn't even involve the CPU, let alone the JVM. So to go back to ymo's question, forced to do for what? Even the kernel bypass shims aren't "free" to use from a developer time perspective, if you care about your code actually working.

Vitaly Davidovich

未读，

2015年2月18日 15:13:222015/2/18

收件人 mechanica...@googlegroups.com

Well, this depends :). If you look at messaging as a stream of operations (i.e. stream of sends or receives), then it's probably better to optimize for mean latency of all messages rather than only the first one (i.e. you want to amortize the cost/overhead of sending a message). Whether you get noticeably worse latency for the 1st message clearly depends on how much extra work you have to do as part of batching. Naively, I don't expect there to be noticeable overhead to doing that in the common case, but clearly we can come up with some scenarios where batching requires additional calculations, marshaling, or something else that takes time away from getting the 1st message out the door.

I'm not entirely clear on your 2nd paragraph -- what do you mean by:

But in an HFT context, the end point of optimizing for latency doesn't even involve the CPU, let alone the JVM

?

Martin Thompson

未读，

2015年2月18日 15:13:392015/2/18

收件人 mechanica...@googlegroups.com

On 18 February 2015 at 20:04, Jimmy Jia <tes...@gmail.com> wrote:

Martin/Vitaly:

Batching is something of a latency/throughput tradeoff, no? If you batch the send, you might get better latency for the 2nd message, but worse latency for the 1st message. It really then depends on your use case.

This is not the case if done well. I've always found I work algorithms to reduce latency by batching. If you batch based on a time out then you can increase latency. However if you send as soon as possible and naturally batch while sending then latency is not practically impacted. The core design of Aeron is based on natural batching. Try it against any other implementation to see how it does on latency ;-)

Also I blogged on this subject ages ago.

http://mechanical-sympathy.blogspot.co.uk/2011/10/smart-batching.html

But in an HFT context, the end point of optimizing for latency doesn't even involve the CPU, let alone the JVM. So to go back to ymo's question, forced to do for what? Even the kernel bypass shims aren't "free" to use from a developer time perspective, if you care about your code actually working.

Open Onload can be loaded and the developer code does not need to change at all. I think I'm not understanding this last point. Can you elaborate?

ymo

未读，

2015年2月18日 15:14:092015/2/18

收件人 mechanica...@googlegroups.com

I should rephrase as bypass java altogether to achieve better throughput (for batch operations) and latency (for oltp type requests). I am trying to figure what bypass tricks people use.

On Wednesday, February 18, 2015 at 3:04:33 PM UTC-5, Jimmy Jia wrote:

Martin/Vitaly:

Batching is something of a latency/throughput tradeoff, no? If you batch the send, you might get better latency for the 2nd message, but worse latency for the 1st message. It really then depends on your use case.

But in an HFT context, the end point of optimizing for latency doesn't even involve the CPU, let alone the JVM. So to go back to ymo's question, forced to do for what? Even the kernel bypass shims aren't "free" to use from a developer time perspective, if you care about your code actually working.

On Wed Feb 18 2015 at 2:52:07 PM Vitaly Davidovich <vit...@gmail.com> wrote:

Yeah, could've been me :)

+1 on batching in particular. In fact, there was a follow-on talk at LCA 2015 on the kernel memory manager (http://lwn.net/Articles/629152/ -- pretty sure this one hasn't been posted :)) which is also looking at adding batching to kernel mem alloc routines (partly driven from the networking stack demands).

On Wed, Feb 18, 2015 at 2:47 PM, Martin Thompson <mjp...@gmail.com> wrote:

Yeah it is a good article, I think you or someone else posted it before. API/protocols sooooooo need to go async and support batching if we are to take further steps in performance.

On 18 February 2015 at 19:01, Vitaly Davidovich <vit...@gmail.com> wrote:

Don't recall if this was already posted to this list before, but either way, here's an interesting presentation (linked to from the lwn article) on the topic of kernel network stack: http://lwn.net/Articles/629155/

On Wed, Feb 18, 2015 at 1:40 PM, Martin Thompson <mjp...@gmail.com> wrote:

Not so snarky ;-) Actually a good point. If using Java on Linux and you need low latency comms then Solarflare with Open Onload, or Mellanox are good options.

A really key thing to reduce latency is to batch up the expensive operations to amortise the cost. Consider micro burst scenarios. A great example is using sendmmsg() and recvmmsg(). So if you get to batch 2+ frames for each system call then you beat going via a user space stack.

Next logical step is to do JNI and go to a native API for a user space stack and use batch semantics.

The main advantage to the likes of Onload is the avoidance of kernel introduced jitter on top of its faster path.

On 18 February 2015 at 16:02, Jimmy Jia <tes...@gmail.com> wrote:

The snarky answer is that if you care about performance in that context, you probably shouldn't be using kernel sockets anyway, and at least be using a shim for stack bypass.

On Wed, Feb 18, 2015, 10:50 ymo <ymol...@gmail.com> wrote:

Were you ever forced to use native sockets instead of nio because of performance ? I am addressing this to people in hft in particular.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jimmy Jia

未读，

2015年2月18日 15:23:592015/2/18

收件人 mechanica...@googlegroups.com

Martin:

That makes sense regarding the batching. That seems like the sensible approach.

Regarding bypass shims, I'm more familiar with VMA than with OpenOnload, and while you do just use LD_PRELOAD, with VMA6 TCP bypass, there's a good chance that your code is not actually going to work, and that you'll have to invest a fair amount of time getting things configured correctly.

ymo:

This white paper gives a bit of an overview: http://www.mellanox.com/related-docs/whitepapers/FTTC_10GbE_Report_FINAL.PDF

You'll have to decide for yourself if these sorts of time savings are material in your context.

Vitaly:

I mean putting your entire fast path (market data in, order out) in FPGA.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

未读，

2015年2月18日 15:29:162015/2/18

收件人 mechanica...@googlegroups.com

Well, using custom hardware (e.g. FPGA) is an entirely separate matter -- if one is doing that, then their issue isn't java vs c/c++. Keep in mind that FPGA development turnaround time is, compared to commodity server, much slower. If you're in an HFT space, quite likely the only thing you can program them to do is things that don't change all that often (market data parsing is one of them, as you say). For more fluid things (e.g. trading strategies), you're unlikely to go down the FPGA path. But, you can mix the two ...

Vitaly Davidovich

未读，

2015年2月18日 15:33:552015/2/18

收件人 mechanica...@googlegroups.com

For purely java overhead, you may want to look at the Netty codebase. Besides JNI overhead, I think there are hot code paths that allocate in NIO -- IIRC, Netty wrote their own NIO replacement that's more frugal with allocations. I think I've seen Norman Maurer on this list before, in which case if he sees this thread he may be able to give you much more insight.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Jimmy Jia

未读，

2015年2月18日 15:34:542015/2/18

收件人 mechanica...@googlegroups.com

The lowest latency trading strategies these days are typically implemented such that responding is done entirely in FPGA. Obviously there's software and CPUs involved, but your critical path never touches a CPU.

If you're not there (or past that, my knowledge is not fully up-to-date, which is why I'm talking about this at all), you're in the world of making latency/throughput/ease-of-development compromises.

This was more w/r/t Martin saying "API/protocols sooooooo need to go async and support batching if we are to take further steps in performance". This is not true in the lowest latency contexts, just because those sorts of things aren't on the critical path any more.

Vitaly Davidovich

未读，

2015年2月18日 15:50:322015/2/18

收件人 mechanica...@googlegroups.com

I don't dispute that there are cases where the entire system is coded into the FPGA, but I think that's a minority of the segment (HFT is too broad as there are layers within that) as this has some practical implications on the system. Sure, we can say that having some stuff on a CPU means we've traded off latency/throughput in favor of development, but that's just a practical tradeoff. This can be analogously viewed to saying "I'm going to build zero introspection/metrics/alerting/logging into my trading system because it impacts performance" -- sure, you can do that and gain perf, but good luck operating that system reliably and being agile over a sustained period of time.

Jimmy Jia

未读，

2015年2月18日 16:09:402015/2/18

收件人 mechanica...@googlegroups.com

Yeah. This is really a definition game. You can say "UHFT" and "normal HFT". Meanwhile, I can say "HFT" and "people who are slow". There's no correct or incorrect here.

There's definitely tons of value to doing stuff faster in software. I guess it just matters what ymo means by "people in hft".

Martin Thompson

未读，

2015年2月18日 17:00:412015/2/18

收件人 mechanica...@googlegroups.com

On 18 February 2015 at 20:34, Jimmy Jia <tes...@gmail.com> wrote:

The lowest latency trading strategies these days are typically implemented such that responding is done entirely in FPGA. Obviously there's software and CPUs involved, but your critical path never touches a CPU.

If you're not there (or past that, my knowledge is not fully up-to-date, which is why I'm talking about this at all), you're in the world of making latency/throughput/ease-of-development compromises.

This was more w/r/t Martin saying "API/protocols sooooooo need to go async and support batching if we are to take further steps in performance". This is not true in the lowest latency contexts, just because those sorts of things aren't on the critical path any more.

For latency arb then you re looking at FPGA or ASICS in switches. However there is lots of other trading strategies than latency arb. There is also lots of software and some hybrid.

There is also a significant difference in the need to be fastest and being fast enough within a time to market for a strategies shelf. For many trading strategies it is sufficient to be fast enough that you don't get picked off or miss an opportunity.

ymo

未读，

2015年2月18日 17:26:472015/2/18

收件人 mechanica...@googlegroups.com

I am surprised how no one has mentioned lockfree queues between c/c++ and java to bypass nio and garbage collection altogether up to now. Meaning a java thread is the consumer and the c++ thread is the producer (or vice versa). I was thinking that this would be very prevalent by now.

IS anyone using this ?

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jimmy Jia

未读，

2015年2月18日 17:55:292015/2/18

收件人 mechanica...@googlegroups.com

Martin: Not just latency arb. But I agree with your point. There's no correct way to build a generic HFT system, it's just a question of what's useful for your trade. Ultimately it's only about being fast enough for your strategy.

ymo: There are cases where this architecture could be useful, but if you're just putting messages on a queue from Java for C++ code to consume, why not just use JNI?

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

未读，

2015年2月18日 18:04:032015/2/18

收件人 mechanica...@googlegroups.com

If all your messages have "known" lengths you can bypass the JNI overhead and the garbage collection since you are using off-heap memory here. IT also forces more control on the memory layout and access.

But i have not seen this used widely so far.

Richard Warburton

未读，

2015年2月18日 18:07:052015/2/18

收件人 mechanica...@googlegroups.com

Hi,

I am surprised how no one has mentioned lockfree queues between c/c++ and java to bypass nio and garbage collection altogether up to now. Meaning a java thread is the consumer and the c++ thread is the producer (or vice versa). I was thinking that this would be very prevalent by now.

IS anyone using this ?

Aeron's log buffer data structures act in this way. Different processes - doesn't matter which languages is implemented the client or the daemon. At the moment only the Java ports are complete.

regards,

Richard Warburton

http://insightfullogic.com

@RichardWarburto

Jimmy Jia

未读，

2015年2月18日 18:10:192015/2/18

收件人 mechanica...@googlegroups.com

I don't understand. You have to serialize whatever you want to put into the queue into some sort of buffer, at which point you have full control over how things look in the buffer. You can do this just as well for handing things off the JNI code. Until you hand the serialized message off, you're fully vulnerable to GC no matter what. As long as you're not interacting back with anything on-heap from your JNI code, you're not going to be affected by GC there.

Granted, there are async things you can do with a separate process, but that seems like the main benefit here, if you need it - not latency in the general case of making something that looks like a blocking call.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

未读，

2015年2月18日 18:12:132015/2/18

收件人 mechanica...@googlegroups.com

This *is* used, typically using DirectByteBuffer (but one can roll their own buffer). However, there's still JNI involved (although you could batch this to amortize the cost) as the native and managed code need to communicate the base address of the exchanged data blob(s).

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Jimmy Jia

未读，

2015年2月18日 18:15:402015/2/18

收件人 mechanica...@googlegroups.com

Yes - my assertion is that if you just want to go do some work in C++, the simplest way to go about doing this is to pass a DirectByteBuffer or something equivalent into C++ code via JNI, as opposed to using a queue, unless you explicitly need the C++ worker to work asynchronously.

Vitaly Davidovich

未读，

2015年2月18日 18:24:022015/2/18

收件人 mechanica...@googlegroups.com

Yup, I was simply telling ymo that JNI is always there; in fact, there are very few calls into native land in java that do not go via JNI (this is basically restricted to JIT intrinsics that make libc calls without JNI involvement, such as System.currentTimeMillis).

I also agree that there's typically no sharing of a queue; there may be queues on either native and/or java side, but that's an implementation detail that's not shared -- just the messages/data on them.

ymo

未读，

2015年2月18日 18:24:122015/2/18

收件人 mechanica...@googlegroups.com

Richard .. this is why i am following aeron (very) closely )))

ymo

未读，

2015年2月18日 18:27:482015/2/18

收件人 mechanica...@googlegroups.com

I am referring here to a circular buffer with a queue and tail exactly the way aeron does it but between c++ and java. There is no JNI involved here at all. But you have to use unsafe unfortunately. If you want to stay away from unsafe this is not your cup of tea.

Vitaly Davidovich

未读，

2015年2月18日 18:41:582015/2/18

收件人 mechanica...@googlegroups.com

What do you mean by "no JNI involved"? I'm assuming there's at least a JNI call to share the base address of the queue to the native code. If you're talking about no JNI on queue operations, then ok -- you can certainly do that provided you have a communication protocol between the two sides. The downside of this approach is that your data is flattened into an array on both sides basically; that's not necessarily a natural approach one would take in C or C++ (i.e. you may want to model your data via structs), and likely not the layout of data you'd have if the C code is also used by other C (or non-java) programs.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Barker

未读，

2015年2月18日 18:49:232015/2/18

收件人 mechanica...@googlegroups.com

If two processes mmap the same file, then they their virtual address will map to the same physical memory addresses, so base addresses can be shared that way, no JNI call to the native call is required.

Mike

Vitaly Davidovich

未读，

2015年2月18日 18:52:172015/2/18

收件人 mechanica...@googlegroups.com

Yes, but I didn't think we were talking about mmap'd files (i.e. IPC)? If we're talking about a shared queue between different *processes*, then yes, agreed.

Rishi

未读，

2015年2月18日 21:02:542015/2/18

收件人 mechanica...@googlegroups.com

Would an NPU like Tilera provide an ideal trade off between development time and latency? You can bypass kernel latency and PCIe latency.

Martin Thompson

未读，

2015年2月19日 07:31:342015/2/19

收件人 mechanica...@googlegroups.com

To avoid allocation and some of the expensive costs with NIO we had to do some unnatural acts for Aeron. Some of this has been inspired by Netty ;-)

https://github.com/real-logic/Aeron/blob/master/aeron-driver/src/main/java/uk/co/real_logic/aeron/driver/TransportPoller.java

To reduce latency and remove allocations we replaced the internal collections via reflection. Aeron is allocation free in steady state. After some research we found that the selector is fine for a higher numbers of sockets but simple polling directly to the channel works better for low numbers of sockets. The referenced TransportPoller class encapsulates this.

ymo

未读，

2015年2月19日 07:38:212015/2/19

收件人 mechanica...@googlegroups.com

Shared queues (via mmap or even shmem) can be between two different processes or threads/cores within the same process. As far as the cpu is concerned it is just a memory location as long as it is properly aligned. My question is that i have not seen actual code doing this between java and c++ (so far) let alone benchmarking it against JNI.

回复全部

回复作者