DPDK with write combining

Marc Richards

<marc@talawah.net>

unread,

Dec 29, 2021, 9:15:42 PM12/29/21

to seastar-dev

Hi all,

The ENA DPDK driver supports a feature called write combing[1] that is supposed enable the use of the ENA hardware's low latency queue functionality to improve performance. Support was built into the igb_uio kernel module, but the vfio-pci module must be patched/rebuilt to include support. After a few false starts[2][3] I was able to rebuild the VFIO module successfully, but I am not seeing any changes in performance or metrics.

I was wondering if anyone here might be able to shed light on whether not it is reasonable to expect any performance difference from this feature with a simple Seastar HTTPD workload (in a low-latency cluster placement group). I also opened an issue on the ENA driver repo[4] to see if there is a way to verify that the feature is working when using DPDK.

1. https://doc.dpdk.org/guides/nics/ena.html#prerequisites

2. https://github.com/amzn/amzn-drivers/issues/200

3. https://github.com/amzn/amzn-drivers/issues/201

4. https://github.com/amzn/amzn-drivers/issues/202

Avi Kivity

<avi@scylladb.com>

unread,

Jan 3, 2022, 12:45:37 PM1/3/22

to Marc Richards, seastar-dev

On 30/12/2021 04.15, Marc Richards wrote:

Hi all,

The ENA DPDK driver supports a feature called write combing[1] that is supposed enable the use of the ENA hardware's low latency queue functionality to improve performance. Support was built into the igb_uio kernel module, but the vfio-pci module must be patched/rebuilt to include support. After a few false starts[2][3] I was able to rebuild the VFIO module successfully, but I am not seeing any changes in performance or metrics.

I was wondering if anyone here might be able to shed light on whether not it is reasonable to expect any performance difference from this feature with a simple Seastar HTTPD workload (in a low-latency cluster placement group). I also opened an issue on the ENA driver repo[4] to see if there is a way to verify that the feature is working when using DPDK.

I'm not surprised you're not seeing extra performance. From what I can tell it avoids a DMA access for the header, maybe 100-200 nanoseconds. This will only be measurable in ping-pong workloads that have very few packets in flight.

You may be able to measure it with wrk with one thread and one connection, but maybe even not that.

1. https://doc.dpdk.org/guides/nics/ena.html#prerequisites

2. https://github.com/amzn/amzn-drivers/issues/200

3. https://github.com/amzn/amzn-drivers/issues/201

4. https://github.com/amzn/amzn-drivers/issues/202

--
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/7e5b05f3-61a0-4b85-abe8-dcec12c88889n%40googlegroups.com.

Marc Richards

<marc@talawah.net>

unread,

Jan 3, 2022, 3:05:51 PM1/3/22

to Avi Kivity, seastar-dev

Thanks for the additional insight Avi. It also appears that the patch wasn't working quite as expected with kernel 5.15[1]. Even though I updated the code to apply the patch cleanly, it needed to be updated to build a new module as well. I will keep your example workload in mind when I test it again.

1. https://github.com/amzn/amzn-drivers/pull/204

Marc Richards

<marc@talawah.net>

unread,

Jan 6, 2022, 5:11:39 PM1/6/22

to seastar-dev

On Monday, January 3, 2022 at 3:05:51 PM UTC-5 Marc Richards wrote:

Thanks for the additional insight Avi. It also appears that the patch wasn't working quite as expected with kernel 5.15[1]. Even though I updated the code to apply the patch cleanly, it needed to be updated to build a new module as well. I will keep your example workload in mind when I test it again.

1. https://github.com/amzn/amzn-drivers/pull/204

I tested the fix using the branch from the pull request and as it turns out there is a significant improvement in throughput (41%) for the HTTPD workload as well as reduction in round-trip ping times.

Non-DPDK interface (sudo ping -q -U -i 0 -s 18 -w 5 172.31.10.66)
--------------------------------

110246 packets transmitted, 110246 received,
rtt min/avg/max/mdev = 0.039/0.045/0.162/0.004 ms

DPDK without VFIO patch (sudo ping -q -U -i 0 -s 18 -w 5 172.31.10.95)
--------------------------------

120592 packets transmitted, 120592 received
rtt min/avg/max/mdev = 0.035/0.041/0.449/0.005 ms

DPDK WITH VFIO patch (sudo ping -q -U -i 0 -s 18 -w 5 172.31.10.95)
--------------------------------

130087 packets transmitted, 130086 received
rtt min/avg/max/mdev = 0.033/0.038/0.184/0.004 ms

DPDK without VFIO patch (twrk --latency --pin-cpus "http://172.31.10.95:8080/" -t 16 -c 256 -D 1 -d 5)
--------------------------------

Running 5s test @ http://172.31.10.95:8080/
16 threads and 256 connections
Thread Stats Avg Stdev Max Min +/- Stdev
Latency 639.48us 179.66us 1.53ms 63.00us 70.61%
Req/Sec 24.96k 302.30 25.84k 24.18k 66.84%
Latency Distribution
50.00% 636.00us
90.00% 0.90ms
99.00% 1.00ms
99.99% 1.22ms
1986090 requests in 5.00s, 259.49MB read
Requests/sec: 397213.00

DPDK WITH VFIO patch (twrk --latency --pin-cpus "http://172.31.10.95:8080/" -t 16 -c 256 -D 1 -d 5)
--------------------------------

Running 5s test @ http://172.31.10.95:8080/
16 threads and 256 connections
Thread Stats Avg Stdev Max Min +/- Stdev
Latency 451.02us 107.81us 1.06ms 68.00us 78.97%
Req/Sec 35.28k 393.39 36.49k 34.35k 67.47%
Latency Distribution
50.00% 465.00us
90.00% 561.00us
99.00% 650.00us
99.99% 0.88ms
2807560 requests in 5.00s, 366.82MB read
Requests/sec: 561505.04

The PR should be merged in the near future but I will also inquire about getting the change into the upstream kernel.

Reply all

Reply to author

Forward