Thanks for the reply.
Actually, we have already tried the method described in the provided link (https://github.com/aliireza/ddio-bench
), but unfortunately it won't work. It changes DDIO state by modifying the PERFCTRLSTS_0 PCIe register of the RDMA NIC, which only works on 2nd Gen Xeon processors.
Changing LLC ways reserved for DDIO also contributes little to improving the performance. The underlying reason is that with DDIO enabled, RDMA writes will go into the LLC. LLC evicts at cacheline (64B) granularity, while PM has an internal granularity of 256B, so random evictions from LLC causes an internal write amplification of roughly 4x. Reserving more cache ways for DDIO does not make LLC evict "sequentially", so sadly this does not help.
The benchmarking details are:
- Server: Intel Xeon Gold 6330 CPU, Mellanox CX-6 200Gbps single port RDMA NIC, 4x128GB DCPMMs installed
- Server-side PM region configured to devdax mode; first mmap /dev/dax0.0, then expose mmap-ed PM to client by ibv_reg_mr
- Client performs sequential RDMA writes (IBV_WR_RDMA_WRITE) to the server-side PM and measure write bandwidth for different write sizes
Results: we observe a maximum write bandwidth of ~3.5 GB/s at client side and serious write amplification at server side.