Systematic Process to Reduce Linux OS Jitter

Jean Dagenais

unread,

Mar 30, 2015, 7:53:15 AM3/30/15

to mechanica...@googlegroups.com

Hello,

I am working on a project to investigate replacements for our current servers

Current servers: 4 sockets based E5-46xx, with significant numa impact
Proposed New servers: 2 sockets E5-26xx V3, as recommended in one of the thread

We are building low-latency trading systems, using Red Hat 6, SolarFlare, Java off-heap memory, etc.

With the current platform, we have a lot of OS Jitter (which was confirmed by running JHiccup, MicroJitterSampler, sysjitter(SolarFlare), perf sched. Thanks for these great tools!).

I have been trying to find information about how to systematically identify the sources of jitter, and remove them one by one.

For example, there is the Red Hat 7 "network-latency" tuned profile, which from this video seems to reduce it significantly.

https://www.youtube.com/watch?v=Y-LtbDsS5rM

These are a few questions:

Have you used the Red Hat "network-latency" profile with success?

Are you using the tuned profile infrastructure to build your own low "low-latency-trading-app" profile?

Do you have a list of changes you always make to the HW/OS for Low Latency Trading Apps?

I am aware of the BIOS recommendations for low latency/high performance, and this seems to be relatively clear

Has any of you played with the prefetcher options? Are they always the best for Java applications?

For the Linux OS, there seems to be significantly more choices and I wonder what is the priority you use?

Boot options like nohz=off ISOLCPUS, processor.max_cstate=1, intel_idle.max_cstate=0
Run time options like kernel.sched_min_granularity_ns, kernel.sched_migration_cost, vm.swappiness
Use of cgroups (Does it have a performance impact, since it control process resources usage, e.g. memory and cpu?)

Thanks for your time and advice!

Jean

Gary Mulder

unread,

Mar 30, 2015, 2:42:30 PM3/30/15

to mechanica...@googlegroups.com

On 30 March 2015 at 12:53, 'Jean Dagenais' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:

Hello,

I am working on a project to investigate replacements for our current servers
Current servers: 4 sockets based E5-46xx, with significant numa impact
Proposed New servers: 2 sockets E5-26xx V3, as recommended in one of the thread
We are building low-latency trading systems, using Red Hat 6, SolarFlare, Java off-heap memory, etc.

With the current platform, we have a lot of OS Jitter (which was confirmed by running JHiccup, MicroJitterSampler, sysjitter(SolarFlare), perf sched. Thanks for these great tools!).

I have been trying to find information about how to systematically identify the sources of jitter, and remove them one by one.

For example, there is the Red Hat 7 "network-latency" tuned profile, which from this video seems to reduce it significantly

That's a very good first pass.

Do you have a list of changes you always make to the HW/OS for Low Latency Trading Apps?

Without knowing what the RH tuner does it hard to give specific additional information, but I'd check your NIC and TCP/IP configuration settings very carefully as the default settings are often optimised for throughput, and not latency. Have a look at the discussion about TOE, for example:

http://www.dinigroup.com/new/TOE.php

It might make sense to capture time series data on interrupt generation to see if you can correlate any jitter with interrupt spikes as well.

You've likely already disabled hyper-threading and Intel Turbo Boost. The E5 1630 v3's Turbo speed is only 0.1GHz faster than the base speed so looks to be your best choice. Check out cpustat and cputrack for monitoring CPU counters, but be aware they can cause jitter themselves.

Regards,

Gary

Gil Tene

unread,

Apr 5, 2015, 11:24:59 PM4/5/15

to mechanica...@googlegroups.com

Finding the cause of hiccups/jitters in a a Linux system is black magic. You often look at the spikes and imagine "what could be causing this".

Based on empirical evidence (across many tens of sites thus far) and note-comparing with others, I use a list of "usual suspects" that I blame whenever they are not set to my liking and system-level hiccups are detected. Getting these settings right from the start often saves a bunch of playing around (and no, there is no "priority" to this - you should set them all right before looking for more advice...). My current starting point for Linux systems that are interested in avoiding many-msec hiccup levels is:

1. Turn THP (Transparent Huge Pages) OFF.

2. Set vm.min_free_kbytes to AT LEAST 1GB (8GB on larger systems).

3. Set Swappiness to 0.

4. Set zone_reclaim_mode to 0.

The defaults for items 1-4 are "wrong" on linux, and each can (independently) cause many-msec hiccups to occur. I know because I've personally run into each of them actually doing that (i.e. caught them red-handed, e.g. with a kernel stack trace showing THP's hands in the cookie jar).

In addition, I usually recommend:

5. Turn HT (Hyper-threading) ON. (double the vcore run queues --> umpteen times lower likelihood of waiting for a cpu).

While HT=Off is often recommended, it is a pre-mature speed optimization that comes with a significantly increased likelihood of "jitter" (if you consider a multi-msec stall "jitter"). Turning hyper threading off will never help reduce your system-caused hiccups (at least not in the >20usec level). Turning HT off *may* (or may not) help improve the linear speed of execution of a thread. But it comes at the cost of halving the number of run queues available to the OS, which dramatically increases the likelihood of a scheduler-quantum level hiccups occurring if more runnable threads than cores exist even for a few msec. That's why I usually like to keep HT on until you get the hiccups out of your system. Then you can experiment with turning it off to see if (a) the hiccups don't come back, and (b) some other metrics got better.

Using numactl, taskset, and isolcpus can all help individual threads with the jitter or hiccups they may experience (in addition to cache behavior, etc.). Same goes for irqbalance. They are all nice for advanced stuff, but I like to clean the system up first, and only assign cores to things after that. In addition, once you do assign cores and such, you should start measuring hiccups/jitter separately on the cores (or set of cores) you are interested in. E.g. If you lock your process to node 1 with numactl, makes sure you run jHiccup (or whatever other tool you use) such that it is limited to that same node.

A last note about assigning cores: for hiccup/jitter control, keeping things away from the cores you use is (much) more important than assigning your workload to specific cores. It is critically more important to assign the cores for the rest of the system (e.g. init and everything that comes from it after boot) to stay away from your favorite workload than it is to carefully plot out which core will execute which thread of your workload. I often see systems in which people carefully place threads on cores, but those cores are shared with pretty much any not-specificaly-assigned process in the system (Yes, this seems silly, but it is also very common to find in the wild). My starting recommendation is often to begin by keeping your "whole system" running on one socket (e.g. node 0), and your latency-sensitive process on "the other socket" (e.g. node 1). Once you get that running nice and clean, and show that you don't have large hiccups you don't like and still need to get rid of, you can start choosing cores within the socket if you want. But you'll often find that you don't need to go that far...

isolcpus are somewhat (but not entirely) different. With isolcpus you get the benefit of knowing that nobody will "accidentally" share your core. But you also lose the scheduler's core-balancing nature for the threads that you assign to isolcpus cores. So while it is sometimes useful to choose specific threads to use on isolcpus cores, the rest of the process (e.g. "the rest of the JVM", along with any background workers you may have in your process) will still benefit from a low-hiccup system or node, and that benefit will in turn show up even on the critical isocpus-assigned thread. E.g. in JVMs, even an isolcpus-assigned thread's worst hiccups will be dominated by things like time-to-safepoint across all JVM threads, so you are still susceptible to hiccups outside of isolcpu-assigned threads and cores.

Gary Mulder

unread,

Apr 6, 2015, 1:43:19 PM4/6/15

to mechanica...@googlegroups.com

On 6 April 2015 at 04:24, Gil Tene <g...@azulsystems.com> wrote:

A last note about assigning cores: for hiccup/jitter control, keeping things away from the cores you use is (much) more important that assigning your workload to specific cores. It is critically more important to assign the cores for the rest of the system (e.g. init and everything that comes from it after boot) to stay away from your favorite workload than it is to carefully plot out which core will execute which thread of your workload. I often see systems in which people carefully place threads on cores, but those cores are shared with pretty much any not-specificlay-assigned process in the system (Yes, this seems silly, but it is very common to find). My starting recommendation is often to start by keeping your "whole system" running on one socket (e.g. node 0), and your latency-sensitive process on "the other socket" (e.g. node 1). Once you get that running nice and clean, you could start choosing cores within the socket if you want, but you'll often find that you don't need to...

Cheers for the correction Gil. I naively assumed the scheduler would be smart enough to never pre-empt a hot thread if enough cores are sitting idle.

It sounds like Linux is in need of a real time scheduler if you're recommending hardware tweaks to make the scheduler's job easier when the core count should theoretically be sufficient. I found the following article on PREEMPT_RT, but it could be a bit dated:

http://www.linuxjournal.com/magazine/real-time-linux-kernel-scheduler

In relation to separating the "system" and "process" across sockets, my understanding is that the Linux kernel boots of core 0 (on CPU 0). Does anyone know whether it continues to be bound to core 0 and whether this is significant for kernel I/O? I wonder what the NUMA implications are for copies from kernel space to user space if JVM threads are bound to CPU 1 as the memory latency hit for a non-local memory access is very high.

Regards,

Gary

Gil Tene

unread,

Apr 6, 2015, 2:29:59 PM4/6/15

to mechanica...@googlegroups.com

On Monday, April 6, 2015 at 10:43:19 AM UTC-7, Gary Mulder wrote:

On 6 April 2015 at 04:24, Gil Tene <g...@azulsystems.com> wrote:
...

Cheers for the correction Gil. I naively assumed the scheduler would be smart enough to never pre-empt a hot thread if enough cores are sitting idle.

It sounds like Linux is in need of a real time scheduler if you're recommending hardware tweaks to make the scheduler's job easier when the core count should theoretically be sufficient. I found the following article on PREEMPT_RT, but it could be a bit dated:

http://www.linuxjournal.com/magazine/real-time-linux-kernel-scheduler

Unfortunately, while real-time scheduling *within* a core is a well understood problem, the notion of "real time scheduling across cores" is more of a pipe dream (at least on current OSs that use per-core run queues).

Those of us who grew up in the presence of real time OSs, but in a single-core world, are used to thinking of real time OSs in terms of the simple "the highest priority runnable thread is always the one running" guarantee. That's where things like priority inversion and priority inheritance play nice roles. But this concept does not transfer well to multi-core systems, unless the OS is willing to make a guarantee along the lines of "the N highest priority threads are always the ones running" (when N is the number of cores in the system). This guarantee is "technically doable" with a carefully managed shared, global, monolithic run queue. But virtually all current OS kernels use per-core run queues, which leaves the "balancing" of work between cores to a combination of background re-balancers and idle cores "stealing" work. Whenever there are more runnable threads than cores, the only guarantee you have left is that the single highest priority runnable thread in the system will be running on some core in the system. But the next N-1 high priority thread may not be running because they were unlucky enough to share a core with the #1 thread, while lower priority threads are running on other cores. A re-balancing event would be needed for that to change, and that may happen only once every few hundreds of usec or msec (depending on kernel settings) when all cores are actually busy doing something (which is when you really feel this problem).

In relation to separating the "system" and "process" across sockets, my understanding is that the Linux kernel boots of core 0 (on CPU 0). Does anyone know whether it continues to be bound to core 0 and whether this is significant for kernel I/O?

While bootstrapping often occurs on core 0, The kernel is everywhere.... On all cores. Kernel code gets executed when system calls enter it (on any core that makes them), or when interrupts occur (on whatever core handles that interrupt). There are also "kernel threads" that do background work, but this get scheduled like anything else. On the bottom-up-level, irqbalance determines how interrupts get sent to cores for various devices. Most systems tend to be configured to spread those across the cores.

I wonder what the NUMA implications are for copies from kernel space to user space if JVM threads are bound to CPU 1 as the memory latency hit for a non-local memory access is very high.

On modern multi-core, shared cache sockets, It is certainly beneficial for speed (not so much jitter) to have your user-level workload reside on the same socket that will handle the i/o processing related to interrupts and other crunching in the kernel (like the TCP stack) that deal with it's i/o. This way the data the user process will be accessing via system calls (e.g. read/write) has a good chance of being L3-cache resident, and not require an additional cache miss. How well this works varies by what the kernel may do with this stuff (e.g. TCP checksum in HW or SW), and how the i/o subsystems do their DRAM access. E.g. in newer x86 sockets with ddio coolness, a NIC or HBA could be set up to move data through the socket's cache, such that even data that the kernel never looks at (e.g. a HW checksumed TCP or UDP buffer) will still be L3 cache resident when the user code end up reading it.

So for cutting down i/o related latency, knowing which socket is actually connected to your NIC, and placing your user process and irq handlers on that socket can help. But this also comes with hiccup/jitter complications and considerations. Ideally, you'd place your non-latency-critical processes on "the other socket", but often this is not that easy. E.g. you may have your I/O run across multiple NICS and HBAs, and the PCI slots may be stripped across two sockets, which will leave you with some harder choices to make.

Regards,
Gary

Gary Mulder

unread,

Apr 6, 2015, 3:56:08 PM4/6/15

to mechanica...@googlegroups.com

On 6 April 2015 at 19:29, Gil Tene <g...@azulsystems.com> wrote:

So for cutting down i/o related latency, knowing which socket is actually connected to your NIC, and placing your user process and irq handlers on that socket can help. But this also comes with hiccup/jitter complications and considerations. Ideally, you'd place your non-latency-critical processes on "the other socket", but often this is not that easy. E.g. you may have your I/O run across multiple NICS and HBAs, and the PCI slots may be striped across two sockets, which will leave you with some harder choices to make.

/proc/interrupts and lspci may help a lot here at a guess...

Gary

Jean Dagenais

unread,

Apr 6, 2015, 7:09:43 PM4/6/15

to mechanica...@googlegroups.com

Thanks Gary and Gil for your very valuable and insightful suggestions.

Gil, when you mention <<A last note about assigning cores: for hiccup/jitter control, keeping things away from the cores you use is (much) more important than assigning your workload to specific cores>>.

In our case, we have processes that run as root (management agent), and other non-privileged users.

How would you suggest to keep them on node 0, so node 1 is used for the application processes/threads?

Should we use taskset, numactl, or cgroups on these processes?

Does isolating all the cores on node 1 (isolcpus) will achieve the same results? (e.g. none of these processes would run on node 1)

Thanks,

Jean

Gil Tene

unread,

Apr 6, 2015, 11:22:46 PM4/6/15

to mechanica...@googlegroups.com

Yes, you could use cgroups and cpusets (e.g. move all processes at startup to a common cgroup that uses a cpuset on node 0), but that means using cgroups, which make a slight change to how scheduling works (e.g. in CFS time slicing is hierarchical). You may or may not want that...

The simplest, least change-how-other-things-work way to place all processes on a certain set of cores by default is to taskset init (pid 1) early in the startup process. This will then be inherited to all subsequently launched processes, since they will all be started by init in one way or another. Use an init script and set it (eg. via update-rc) to use a very low priority number to make sure it executes before other stuff does. You can then use numactl to launch stuff onto the other node.

The issue with using isolcpus for this purpose is that the isolcpu cores do not participate in any form of cross-core load balancing. So while using isolcpu for an entire node will keep the various regular processes off of that node, it will also make scheduling of threads within processes assigned to those cores act very differently: Basically, if you launch a many-thread process with a taskset that includes all of the isolcpus in node 1 (for example), this will "appear to work" (i.e. things will run), but all the process threads will end up sharing the first core in the set specified to the taskset, while the rest remain idle. So assuming you want to let the scheduler actually place your various heads across multiple cores in node 1, isolcpu'ing the node is not the way to go. isolcpus is very useful for manually placing (one by one) specific lately-critical threads on specific isolated and dedicated (preferably per thread) cores. And you an still do by picking some specific cores on (e.g.) node 1 and isolcpu'ing them. But you'll want to use the bulk of the node with the regular scheduler, and for the various threads you don't want to be managing explicitly (e.g. vm thread, GC threads, compiler threads, executor thread pools, etc.).

Jean-Philippe BEMPEL

unread,

Apr 7, 2015, 6:45:12 AM4/7/15

to mechanica...@googlegroups.com

Hello Jean,

The way we are setuping our environment is the following:

If we consider node 0 as admin and node 1 as critical (application)

set isolcpus with all cores from node 1 (usually odd core number)
launch with numactl the JVM to bind on node 1 (numatcl --membind=1 --cpunodebind=1 java ...) (ensure local DRAM access for critical threads for any allocation made by the JVM)
with taskset or using the sched_setaffinity call move all admin threads to node 0 leaving only the critical threads on node 1 and placing them depending on the application.

Cheers

Jean-Philippe

Gil Tene

unread,

Apr 7, 2015, 9:05:35 AM4/7/15

to mechanica...@googlegroups.com

On Tuesday, April 7, 2015 at 12:45:12 PM UTC+2, Jean-Philippe BEMPEL wrote:

Hello Jean,

The way we are setuping our environment is the following:

If we consider node 0 as admin and node 1 as critical (application)
set isolcpus with all cores from node 1 (usually odd core number)
launch with numactl the JVM to bind on node 1 (numatcl --membind=1 --cpunodebind=1 java ...) (ensure local DRAM access for critical threads for any allocation made by the JVM)
with taskset or using the sched_setaffinity call move all admin threads to node 0 leaving only the critical threads on node 1 and placing them depending on the application.

While this (JP's isolcpus all cores on node 1 suggestion above) can certainly be made to work, there are two key caveats to keep in mind with it:

1. You would need to manually set the core of each of the threads you place on node 1. Otherwise, all of them will end up sharing the first core in the node's core list, with the rest of the cores remaining idle. This is because isolcpus takes the cores off the scheduler's list of things to balance...

2. If you move any of the "admin" threads (e.g. VM thread, GC threads, compiler threads, your non-critical java threads, etc.) to node 0, those threads remain susceptible to whatever hiccup levels node 0 has, so you still care (a lot) about node 0's hiccup levels. Since the JVM will sometimes bring all threads to a global (JVM-wide) safepoint (e.g. for GC, or deoptimization, or lock debiasing, or thread dumps, or deadlock detection, or one of the other 17 things that can cause a global safepoint to happen), the hiccup levels that affect the threads on node 0 will "leak" into hiccup effects seen by the critical threads on node 1. I.e. a critical thread on node 1 will quickly arrive at a safepoint, and can then spend the next tens-of-msec-or-more waiting for the rest of the JVM threads (including those on node 0, which can be delayed by hiccups there) to reach the safepoint.

The combination of these two mechanisms is why I usually prefer to place the entire JVM ("admin" threads included) on a single socket. It has the added benefit of cutting down on time-to-safepoint and common safepoint operations, since, the running thread stacks tend to be L3-local, making the JVMs safepoint work quicker. And yes, this does mean that your L3 is going to be somewhat affected by admin thread activity, and that your critical thread L1/L2 may be evicted as a side-effect if they are cold (LRU behavior can usually avoid evicting the hot cache contents of critical threads). That's a classic choice between hiccup levels: the huge ones that will happen due to safepoint operations and scheduling artifacts, and the tiny ones that happen due to cache sharing and cross-eviction. I generally prefer to deal with the larger (multi-msec and many-msec) hiccup levels first, and only think about the smaller ones (cache effects only amount to multi-usec effects) once you have the big ones out of the way, and that takes work. E.g. with Zing we can keep those bigger hiccups at "very low" levels even in the presence of GC activity, but "very low" still includes measured hiccups on the order of ~0.5-1 msec several times per hour even on very clean and hand-tuned linux systems. The best I've seen so far was a worst experienced hiccup of ~400usec over multiple days.

An extra-credit alternative which can keep the JVM's background work away from your critical workload's L3 would be to split up node 0 placing the "whole system" on only some of the cores there (e.g. 1/2 of the cores), while keeping a "clean half" of node 0 on which you demonstrate very low hiccup levels because nothing else is running there (not even irqs). You'd then taskset the JVM to use this "clean half" of of node 0, and (once laucnhed) you'd set affinity for your various Java threads to use node 1, potentially involving isolcpus for some: i.e. some (most) of the application's threads can have their affinity set to use scheduling and core-balancing across most of the node 1 cores [the non-isolcpus cores there] (using taskset, or something like Peter Lawrey's thread affinity library from within the process), while a handful of critical threads are assigned manually to separate (and dedicated) isolcpus cores on node 1 (using the same tools). This extra-credit setup has the benefit of reduced cache contamination and cross-eviction, but comes with the slight downside of slightly longer safepoint operations (VM and GC threads examine thread stacks across sockets, taking more L3 misses in the critical path safepoint code). Both of these effects are in the multi-usec level, but I honestly don't know which of these outweighs the other, as it is very hard to measure effects that low, and even when you can measure them, it's hard to deduce the mechanisms that cause whatever effects are seen in the measurements...

Of course, if you have more than 2 nodes (as E5-46xx systems do, and [depending on BIOS config] the latest E5-26xx V3 sockets with 18 cores played out as two nodes in a single socket do), this conversation expands in various ways, but the same principles apply (who is sharing and cross-evicting on which L3?, Who is sharing the cores at a scheduling level? What are the hiccup levels seen by various parts of the system, etc.).

Jean-Philippe BEMPEL

unread,

Apr 7, 2015, 10:18:05 AM4/7/15

to mechanica...@googlegroups.com

The thing is (like always) it depends.

If you try just eliminating hiccups, What Uncle Gil says is wisdom and fully covered in detail (thanks, by the way).

What we try to achieve on our side is make our maximum number of message latency not affected by other processing.

It works well for us because, almost all our data fit in L3 cache. Any other threads (or processes) that may execute on the same socket will inevitably evicts our critical data from L3 leading to cache misses to DRAM which is a disaster for us (usually doubling our latency, roughly 100us to 200us).

Depending on how you measure (Coordinated Omission or not!), and what is your SLA can greatly affect on what you need to focus.

In our case we are measuring 99% and based on some volume, majority of the safepoint hiccups are included into the last 1%. So not an issue. However we need to have for 99% of our latency good numbers that we only can achieve by the setup describes above.

Alexander Turner

unread,

Apr 8, 2015, 12:25:18 PM4/8/15

to mechanica...@googlegroups.com

Hi Gil,

You mention 'not even iras'. So far I have not come across a way to eradicate soft irqs. These are a huge pita in some systems. Have you had any success getting them off latency critical cores?

Thanks - AJ

Gil Tene

unread,

Apr 8, 2015, 1:40:26 PM4/8/15

to mechanica...@googlegroups.com

If you use isolcpus, assign each critical thread to a specific (dedicated) core, and avoid directing any HW interrupts to the cores involved, softirqs tend to not be a big problem on those cores: nothing tends to make them occur there, other than timers, and if your critical thread is causing lots of timer events to fire at the same time, that's it's own problem.

If you don't use isolcpus, any thread on any schedule-able core could be hit with being forced to do a bunch of soft irq work (looking from a scheduling perspective as if it is doing it own stuff). This behavior varies significantly by kernel version across 2.6-3.x. For an interesting description of the saga of soft iqrs behavior over various kernel versions and in the context of real-time-patched kernels, see this nice long summary: https://lwn.net/Articles/520076/

-- Gil.

Jean Dagenais

unread,

Apr 19, 2015, 4:39:32 PM4/19/15

to mechanica...@googlegroups.com

Thanks Gary, Gil, Jean-Philippe, and Alexander for the great information provided in this thread. This surely clarify many aspects of the "Black Magic" required to figure the source of Jitter on Linux based servers as highlighted by Gil.

In the next few weeks I will be starting the evaluation of a new trading platform, and will surely apply all the wise advise!

In the mean time, I have created a performance toolkit that I am using to asses the performance of our current platform, and in case this can be useful to others, I will share the tools into another thread.

Thanks again!

Jean

NeT MonK

unread,

Feb 19, 2016, 10:20:08 AM2/19/16

to mechanical-sympathy

Hello, i would like to add that with rt kernel is it possible to give lower priority to softirq than user process.

Depending your software requirement and associated with isolcpu and nohz, basic while true looping based process are not so much suffering from os/kernel interference.
I guess the real dream is to have a full MS-DOS application with everything in userland like in the old day of demo making :) I heard that game coder and demo maker have been working for hft lately :)

Andriy Plokhotnyuk

unread,

Feb 20, 2016, 8:27:15 AM2/20/16

to mechanical-sympathy

There are posts on Mark Price's blog about low latency tunings in LMAX Exchange:

http://epickrram.blogspot.co.uk/2015/09/reducing-system-jitter.html

http://epickrram.blogspot.co.uk/2015/11/reducing-system-jitter-part-2.html

http://epickrram.blogspot.co.uk/2015/08/jvm-guaranteed-safepoints.html

http://epickrram.blogspot.co.uk/2015/09/runtime-jitter-zing-vs-hotspot.html

Lex Barringer

unread,

Dec 26, 2016, 6:05:31 AM12/26/16

to mechanical-sympathy

I realize this post is a little late to the party but it's good for people looking at tweaking their hardware and software for high frequency binary options trading, including crypto-currency on the various exchanges.

As a note to all people seeking to create ultra low latency systems, not just network components / accessories. The clock rate (clock speed) of the CPU, it's multipliers vs. the memory multipliers, voltages, clock speeds of the modules as well as the CAS timing (as well as other associated memory timing parameters) can have a huge impact on how fast your overall system performance is. Let alone it's actual reaction speed. One of the most important areas are the ratios of the multipliers in the system itself.

While many operations are handled by the NIC in hardware and dedicated FIFO buffers of said devices, it still is necessary to have a tuned system hardware wise to give you the best performance, with the lowest overhead.

Depending on if you use Intel Xeon or the newer 68 or 69 series Intel i7 Extreme series, you may get better trading performance using a consumer grade (non Xeon) processor. I recommend using the Intel i7 6950X, it's comparable in latency, it can handle

memory speeds well in excess of 2400 MHz (2.4 GHz). The key here is to find memory and motherboards that can handle DDR4

3200 MHz memory at CAS 14. You can get faster memory with the same CAS, if your motherboard supports it, then do so at this time.

I've used the following configuration:

Asus E-99 WS workstation board

Intel i7-6950X CPU

64 GiB of CAS 14 @ 3200 MHz RAM

1 Intel P3608 4 TB drive (it's the faster, bigger brother to the consumer Intel 750 NVMe SSD in a PCIe slot)

1 NewWave Design & Verification V5022 Quad Channel 10 Gigabit Ethernet FPGA Server Card

The NIC on the motherboard is used for PXE Netboot and once the computer is booted, it loads in the trading software, then starts it. The V5022 is used for the actual trading because it is very high speed but ultra low latency. You can runs these computers with or without heads (monitors plugged into the video ports). I can't strongly emphasize that all logs from each trading computer must not be stored in the computer's memory, not on the NVMe or some other local disk on the trading computers. You want to make these computers as hardened as possible, you will need a dedicated computer to receive, store and display the messages from each machine in an hardened environment. You competition may send in hackers to try to down your network and computer systems. Don't make it easy for them to do so by keeping local logs of activities occurring over your networks and on the computers themselves. The computer that gathers all the logs need not be a top of the line computer, a minimum of a quad core or dual core with two hyper-threads is sufficient. You could get by with an Intel i3-6100 based system to save money, 16 GiB RAM is plenty for what this machine will be doing.

A note about memory size, if you're worrying about clock jitter and jitter from other sources, having a specific memory size can either work for or against you. Not many people know this but the safe sizes for 64-bit computing memory are the following in GiB; 16, 64, 256. Anything else and you're going to be risking a lot of cache misses and misalignments of data which has a huge latency penalty both on the hardware and in the software. Many people are tempted to put the maximum amount of RAM in their system when they're doing trading but you have to realize how the memory is actually access by the hardware and how the operating system in this case; different distributions of Linux. I see people using; 6, 8, 12, 24, 36, 48, 72, 96, 128, 224, 512 GiB of RAM, strange sizes like this can give you problems because of the way in which the memory managers in Linux are designed. While, technically, yes, the kernel can handle very large memory system and strange sizes like these, it's not a good practice for system designers and builders to get into. Something else to note, the more chips a RAM module has, the more likely you are to have clock jitter, which can lead to some not too nice effects and are very hard to track down from the software side.

You also need to keep your systems and network switches below 50 degrees centigrade (ideally 45 C), not only does it extend the life of your equipment, low latency and jitter are kept within acceptable limits. The warmer the items are, the more unpredictable they become.

If you want more processing power, I would suggest using the Intel Xeon Phi co-processor cards, which use a Linux kernel on each card to manage the software kernel (from an OpenCL or some other computing language to run on said cards). This requires additional software programming, debugging and profiling. It's not a plug and play solution, it can't be used as an automatic extension of the main CPU in the system. I shall not get into that in this post as that's a whole new can of worms to talk about.

I hope this gives you some interesting ideas to investigate further.

Gil Tene

unread,

Dec 26, 2016, 12:09:16 PM12/26/16

to mechanica...@googlegroups.com

One of the biggest reasons folks tend to stay away from the consumer CPUs in this space (like the i7-6950X you mentioned below) is the ack of ECC memory support. I really wish Intel provided ECC support in those chips, but they don't. And ECC is usually a must when driving hardware performance to the edge, especially in FinServ. The nightmare scenarios that happen when you aggressively choose your parts and push their performance to the edge (and even if you don't) with no ECC are very real. The soft-error correcting capabilities (ECC is usually SECDED) is crucial for avoid actually wrong computation results from occurring on a regular basis from simple things like cosmic ray effects on your DRAM, and with the many-GBs capacities we have tin those servers, going without a cosmic-ray-driven bit-flip in DRAM is unlikely.

To move from hand waving and to actual numbers for the notion that ECC is critical (and to hopefully scare the s*&t out of people running business stuff with no soft-error correcting hardware), this 2009 Google Study paper makes for a good read. It covers field data collected between 2006 and 2008. Fast forward to section 3.1 if you are looking for some per-machine summary numbers. The simple takeaway summary is this: Even with ECC support, you have a ~1% chance of your machine experiencing an Uncorrectable Error (UE) once per year. But the chance of a machine encountering a Correctable Error (CE) at least once per year is somewhere in the 12-50% range, and the machines that do (which can be as many as half) will see those errors hundreds of times per year (so once every day or two).

One liner Summary: without hardware ECC support, random bits are probably flipping in your system memory, undetected, on a daily basis.

I believe that the current ECC-capable chips that would come close to the i7-6950X you mentioned below are the E5-1680V4 (for 1 socket setups, peaks at 4.0GHz) and either E5-2687W v4 or E5-2697A v4 (peak at 3.,5 and 3.6GHz respectively, but you'd need to carefully avoid using on the core on the 2697 to get there probably). The E3 series (e.g. E3-1280 v5) usually have the latest cores first, but their core counts tend to be tiny (4 physical cores compared to 8-12 in the others listed above).

Greg Young

unread,

Dec 26, 2016, 12:30:47 PM12/26/16

to mechanica...@googlegroups.com

@Gil this should be a blog post.

On Mon, Dec 26, 2016 at 5:09 PM, Gil Tene <g...@azul.com> wrote:
> One of the biggest reasons folks tend to stay away from the consumer CPUs in

> this space (like the is the i7-6950X you mentioned below) is the ack of ECC

> memory support. I really wish Intel provided ECC support in those chips, but
> they don't. And ECC is usually a must when driving hardware performance to

> the edge, especially in FiunServ. The nightmare scenarios that happen when

> you aggressively choose your parts and push their performance to the edge
> (and even if you don't) with no ECC are very real. The soft-error correcting
> capabilities (ECC is usually SECDED) is crucial for avoid actually wrong
> computation results from occurring on a regular basis from simple things
> like cosmic ray effects on your DRAM, and with the many-GBs capacities we

> have tin those servers, going without a cosmic-ray-driven bit-flip in DRAM .

>
> To move from hand waving and to actual numbers for the notion that ECC is
> critical (and to hopefully scare the s*&t out of people running business
> stuff with no soft-error correcting hardware), this 2009 Google Study paper
> makes for a good read. It covers field data collected between 2006 and 2008.
> Fast forward to section 3.1 if you are looking for some per-machine summary
> numbers. The simple takeaway summary is this: Even with ECC support, you

> have a ~1% of your machine experiencing an Uncorrectable Error (UE) once per

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

Dan Eloff

unread,

Dec 26, 2016, 1:28:48 PM12/26/16

to mechanica...@googlegroups.com

>Not many people know this but the safe sizes for 64-bit computing memory are the following in GiB; 16, 64, 256. Anything else and you're going to be risking a lot of cache misses and misalignments of data which has a huge latency penalty both on the hardware and in the software.

Can someone explain how this works to me? As I understand it, caches work from the physical addresses and are indexed based on the lowest bits of the address. At the gigabyte level only bits above 30 would see an uneven distribution, and only then with odd memory sizes (e.g. 48gb.) I don't see how 32gb can be worse than 16gb or 64gb from a caching or alignment perspective. Obviously the more memory in active use by the system, the more chances of cache conflicts, but apart from that I don't understand it.

Gil Tene

unread,

Dec 28, 2016, 2:24:30 AM12/28/16

to mechanical-sympathy

On Monday, December 26, 2016 at 10:28:48 AM UTC-8, Daniel Eloff wrote:

>Not many people know this but the safe sizes for 64-bit computing memory are the following in GiB; 16, 64, 256. Anything else and you're going to be risking a lot of cache misses and misalignments of data which has a huge latency penalty both on the hardware and in the software.

Can someone explain how this works to me? As I understand it, caches work from the physical addresses and are indexed based on the lowest bits of the address. At the gigabyte level only bits above 30 would see an uneven distribution, and only then with odd memory sizes (e.g. 48gb.) I don't see how 32gb can be worse than 16gb or 64gb from a caching or alignment perspective. Obviously the more memory in active use by the system, the more chances of cache conflicts, but apart from that I don't understand it.

Yeh, GB-level memory sizes have no impact on cache misalignment issues... As you note, cache lines are 64bytes, and physical addresses are distributed across the system at page levels, so even when doing wider accesses (e.g. two adjacent cache lines for 128 bytes) you'd be aligned and hitting the same DIMM no matter what DIMM size mixes you end up using.

There are certainly some impacts that comes from choosing memory sizes, but those mostly have to do with filling the various memory controller channels evenly (to make sure all that memory bandwidth is useable), and dealing with the depth (number of ranks) that each channel ends up driving (which can effect access speed).

In systems with 3 memory channels per socket (Intel 55xx and 56xx), the "natural" balanced sizes for a 2 socket systems were actually multiples of 6 (e.g. 24GB, 48GB, 72GB, 144GB), and normal power of 2 memory sizes (e.g. 64GB) actually would result in unbalanced memory controller loads. From E5-26xx on, sockets have had 4 memory channels, moving natural sizes back to multiples of 8. So yes, 64, 128, 256, 512, but also also 96, 192, 384 and 768.

The number of ranks thing is more complicated, since a single DIMM could have differing numbers of ranks (1, 2, 4, 8). On most E5-26xx systems, you can drive up to 3 DIMMS per memory channel (for a total of 12 DIMMS per socket, 24 DIMMs per 2-socket system), and up to 8 ranks per channel. But at least in some of those systems, the frequency and latency of DRAM access may be worse when more DIMMs and/or ranks are populated in a channel. When looking for maximum DRAM performance, you typically only populate 8 (RDIMMs) or 16 (LRDIMMs) per system on current systems (e.g. see this config guide for how choices affect memory frequencies). Since the cheapest DIMMs to use at this time appear to be 32GB DIMMs (cheaper per GB than 16GB or smaller), this probably means a 256GB or 512GB on newer systems.

NeT MonK

unread,

Jan 5, 2017, 4:49:15 PM1/5/17

to mechanical-sympathy

Thank you for this post.

My question is, where do you use your setup ? In market colocation ?

Where i work we have bunch of Ciara server : http://www.ciaratech.com/category.php?id_cat1=3&id_cat2=61&lang=en

with Intel i7 cpu, kingston ram, asus mother boad, watercooled, and overclocked to 4.9ghz

We are soon going to receive some blackcore servers to poc : http://www.exactatech.com/hft/

Ciara performs far better than HPgen8 xeon server, on our application.

Jean-Philippe BEMPEL

unread,

Jan 6, 2017, 2:19:08 AM1/6/17

to mechanical-sympathy

Hi,

Those servers seem nice, but there is a huge no-go for us and our workload: There is only one socket.

The actual CPUs have a L3 cache that is shared among cores, so depending on your workload other processes & threads can pollute this level of cache.

In our application this pollution cause 2x or more increase of latency compared to dedicated socket for critical threads (Thread affinity + isolcpus/cpuset)

Cheers

Reply all

Reply to author

Forward