DPDK with write combining

338 views
Skip to first unread message

Marc Richards

<marc@talawah.net>
unread,
Dec 29, 2021, 9:15:42 PM12/29/21
to seastar-dev
Hi all,

The ENA DPDK driver supports a feature called write combing[1] that is supposed enable the use of the ENA hardware's low latency queue functionality to improve performance. Support was built into the igb_uio kernel module, but the vfio-pci module must be patched/rebuilt to include support.  After a few false starts[2][3] I was able to rebuild the VFIO module successfully, but I am not seeing any changes in performance or metrics.  

I was wondering if anyone here might be able to shed light on whether not it is reasonable to expect any performance difference from this feature with a simple Seastar HTTPD workload (in a low-latency cluster placement group). I also opened an issue on the ENA driver repo[4] to see if there is a way to verify that the feature is working when using DPDK.

Avi Kivity

<avi@scylladb.com>
unread,
Jan 3, 2022, 12:45:37 PM1/3/22
to Marc Richards, seastar-dev


On 30/12/2021 04.15, Marc Richards wrote:
Hi all,

The ENA DPDK driver supports a feature called write combing[1] that is supposed enable the use of the ENA hardware's low latency queue functionality to improve performance. Support was built into the igb_uio kernel module, but the vfio-pci module must be patched/rebuilt to include support.  After a few false starts[2][3] I was able to rebuild the VFIO module successfully, but I am not seeing any changes in performance or metrics.  

I was wondering if anyone here might be able to shed light on whether not it is reasonable to expect any performance difference from this feature with a simple Seastar HTTPD workload (in a low-latency cluster placement group). I also opened an issue on the ENA driver repo[4] to see if there is a way to verify that the feature is working when using DPDK.


I'm not surprised you're not seeing extra performance. From what I can tell it avoids a DMA access for the header, maybe 100-200 nanoseconds. This will only be measurable in ping-pong workloads that have very few packets in flight.


You may be able to measure it with wrk with one thread and one connection, but maybe even not that.


--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/7e5b05f3-61a0-4b85-abe8-dcec12c88889n%40googlegroups.com.

Marc Richards

<marc@talawah.net>
unread,
Jan 3, 2022, 3:05:51 PM1/3/22
to Avi Kivity, seastar-dev
Thanks for the additional insight Avi. It also appears that the patch wasn't working quite as expected with kernel 5.15[1]. Even though I updated the code to apply the patch cleanly, it needed to be updated to build a new module as well. I will keep your example workload in mind when I test it again.

Marc Richards

<marc@talawah.net>
unread,
Jan 6, 2022, 5:11:39 PM1/6/22
to seastar-dev
On Monday, January 3, 2022 at 3:05:51 PM UTC-5 Marc Richards wrote:
Thanks for the additional insight Avi. It also appears that the patch wasn't working quite as expected with kernel 5.15[1]. Even though I updated the code to apply the patch cleanly, it needed to be updated to build a new module as well. I will keep your example workload in mind when I test it again.



I tested the fix using the branch from the pull request and as it turns out there is a significant improvement in throughput (41%) for the HTTPD workload as well as reduction in round-trip ping times.

Non-DPDK interface (sudo ping -q -U -i 0 -s 18 -w 5 172.31.10.66)
--------------------------------
110246 packets transmitted, 110246 received,
rtt min/avg/max/mdev = 0.039/0.045/0.162/0.004 ms


DPDK without VFIO patch (sudo ping -q -U -i 0 -s 18 -w 5 172.31.10.95)
--------------------------------
120592 packets transmitted, 120592 received
rtt min/avg/max/mdev = 0.035/0.041/0.449/0.005 ms


DPDK WITH VFIO patch (sudo ping -q -U -i 0 -s 18 -w 5 172.31.10.95)
--------------------------------
130087 packets transmitted, 130086 received
rtt min/avg/max/mdev = 0.033/0.038/0.184/0.004 ms
 

DPDK without VFIO patch (twrk --latency --pin-cpus "http://172.31.10.95:8080/" -t 16 -c 256 -D 1 -d 5)
--------------------------------
Running 5s test @ http://172.31.10.95:8080/
  16 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   639.48us  179.66us    1.53ms   63.00us   70.61%
    Req/Sec    24.96k   302.30     25.84k    24.18k    66.84%
  Latency Distribution
  50.00%  636.00us
  90.00%    0.90ms
  99.00%    1.00ms
  99.99%    1.22ms
  1986090 requests in 5.00s, 259.49MB read
Requests/sec: 397213.00


DPDK WITH VFIO patch  (twrk --latency --pin-cpus "http://172.31.10.95:8080/" -t 16 -c 256 -D 1 -d 5)
--------------------------------
Running 5s test @ http://172.31.10.95:8080/
  16 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   451.02us  107.81us    1.06ms   68.00us   78.97%
    Req/Sec    35.28k   393.39     36.49k    34.35k    67.47%
  Latency Distribution
  50.00%  465.00us
  90.00%  561.00us
  99.00%  650.00us
  99.99%    0.88ms
  2807560 requests in 5.00s, 366.82MB read
Requests/sec: 561505.04

The PR should be merged in the near future but I will also inquire about getting the change into the upstream kernel.
Reply all
Reply to author
Forward
0 new messages