Hello,I am working on a project to investigate replacements for our current servers
- Current servers: 4 sockets based E5-46xx, with significant numa impact
- Proposed New servers: 2 sockets E5-26xx V3, as recommended in one of the thread
We are building low-latency trading systems, using Red Hat 6, SolarFlare, Java off-heap memory, etc.With the current platform, we have a lot of OS Jitter (which was confirmed by running JHiccup, MicroJitterSampler, sysjitter(SolarFlare), perf sched. Thanks for these great tools!).I have been trying to find information about how to systematically identify the sources of jitter, and remove them one by one.For example, there is the Red Hat 7 "network-latency" tuned profile, which from this video seems to reduce it significantly
- Do you have a list of changes you always make to the HW/OS for Low Latency Trading Apps?
A last note about assigning cores: for hiccup/jitter control, keeping things away from the cores you use is (much) more important that assigning your workload to specific cores. It is critically more important to assign the cores for the rest of the system (e.g. init and everything that comes from it after boot) to stay away from your favorite workload than it is to carefully plot out which core will execute which thread of your workload. I often see systems in which people carefully place threads on cores, but those cores are shared with pretty much any not-specificlay-assigned process in the system (Yes, this seems silly, but it is very common to find). My starting recommendation is often to start by keeping your "whole system" running on one socket (e.g. node 0), and your latency-sensitive process on "the other socket" (e.g. node 1). Once you get that running nice and clean, you could start choosing cores within the socket if you want, but you'll often find that you don't need to...
On 6 April 2015 at 04:24, Gil Tene <g...@azulsystems.com> wrote:...Cheers for the correction Gil. I naively assumed the scheduler would be smart enough to never pre-empt a hot thread if enough cores are sitting idle.It sounds like Linux is in need of a real time scheduler if you're recommending hardware tweaks to make the scheduler's job easier when the core count should theoretically be sufficient. I found the following article on PREEMPT_RT, but it could be a bit dated:
In relation to separating the "system" and "process" across sockets, my understanding is that the Linux kernel boots of core 0 (on CPU 0). Does anyone know whether it continues to be bound to core 0 and whether this is significant for kernel I/O?
I wonder what the NUMA implications are for copies from kernel space to user space if JVM threads are bound to CPU 1 as the memory latency hit for a non-local memory access is very high.
Regards,Gary
So for cutting down i/o related latency, knowing which socket is actually connected to your NIC, and placing your user process and irq handlers on that socket can help. But this also comes with hiccup/jitter complications and considerations. Ideally, you'd place your non-latency-critical processes on "the other socket", but often this is not that easy. E.g. you may have your I/O run across multiple NICS and HBAs, and the PCI slots may be striped across two sockets, which will leave you with some harder choices to make.
Hello Jean,The way we are setuping our environment is the following:If we consider node 0 as admin and node 1 as critical (application)
- set isolcpus with all cores from node 1 (usually odd core number)
- launch with numactl the JVM to bind on node 1 (numatcl --membind=1 --cpunodebind=1 java ...) (ensure local DRAM access for critical threads for any allocation made by the JVM)
- with taskset or using the sched_setaffinity call move all admin threads to node 0 leaving only the critical threads on node 1 and placing them depending on the application.
Thanks Gary, Gil, Jean-Philippe, and Alexander for the great information provided in this thread. This surely clarify many aspects of the "Black Magic" required to figure the source of Jitter on Linux based servers as highlighted by Gil.
Depending your software requirement and associated with isolcpu and nohz, basic while true looping based process are not so much suffering from os/kernel interference.
I guess the real dream is to have a full MS-DOS application with everything in userland like in the old day of demo making :) I heard that game coder and demo maker have been working for hft lately :)
>Not many people know this but the safe sizes for 64-bit computing memory are the following in GiB; 16, 64, 256. Anything else and you're going to be risking a lot of cache misses and misalignments of data which has a huge latency penalty both on the hardware and in the software.Can someone explain how this works to me? As I understand it, caches work from the physical addresses and are indexed based on the lowest bits of the address. At the gigabyte level only bits above 30 would see an uneven distribution, and only then with odd memory sizes (e.g. 48gb.) I don't see how 32gb can be worse than 16gb or 64gb from a caching or alignment perspective. Obviously the more memory in active use by the system, the more chances of cache conflicts, but apart from that I don't understand it.