Re: How about upgrading DPDK?

56 views
Skip to first unread message

Avi Kivity

<avi@scylladb.com>
unread,
May 15, 2019, 3:55:47 AM5/15/19
to Yibo Cai, ScyllaDB development, seastar-dev, Takuya ASADA

Copying the seastar list as it's really a seastar problem.

On 15/05/2019 05.25, Yibo Cai wrote:
I did some investigation about dpdk updates since 17.05. There are two major changes:

1. new memory subsystem

2. new offloads api
- and many other patches...

Moving to new offload api looks not that hard.

Regarding new memory sybsystem, current seastar code does complex jobs by allocating memory and creating mempool, mbufs manyally. My understanding is it's due to old dpdk eats up all hugepages, so doing it manually to save memory.
If this is the reason, I guess it can be significantly simplified as latest dpdk allocates hugepages on demand.


There are two reasons for the existing solution:


1. Accounting. Seastar allocates all system memory and the Seastar allocator knows how much memory is free, and calls a reclaimer when memory is low or when it cannot satisfy an allocation. This strategy requires that it knows about all allocations. If DPDK can also allocate memory on demand, Seastar has to to leave that memory unallocated. If it leaves too little memory, DPDK will allocate too much and the kernel will run out of memory. If it leaves too much memory, then the application will not be able to use it.

2. Zero-copy: the tcp stack works with temporary_buffers so we can pass the received packets directly to the application. I guess we can still do that by attaching custom deleters to the buffers that will call the DPDK APIs instead of ::free(). Custom deleters can be expensive (require an extra allocation), but maybe it's not too bad, and maybe we can improve that somehow.


I see mempool operations are virtualized:


/** Structure defining mempool operations structure */
struct rte_mempool_ops {
        char name[RTE_MEMPOOL_OPS_NAMESIZE]; /**< Name of mempool ops struct. */
        rte_mempool_alloc_t alloc;       /**< Allocate private data. */
        rte_mempool_free_t free;         /**< Free the external pool. */
        rte_mempool_enqueue_t enqueue;   /**< Enqueue an object. */
        rte_mempool_dequeue_t dequeue;   /**< Dequeue an object. */
        rte_mempool_get_count get_count; /**< Get qty of available objs. */
        /**
         * Optional callback to calculate memory size required to
         * store specified number of objects.
         */
        rte_mempool_calc_mem_size_t calc_mem_size;
        /**
         * Optional callback to populate mempool objects using
         * provided memory chunk.
         */
        rte_mempool_populate_t populate;
        /**
         * Get mempool info
         */
        rte_mempool_get_info_t get_info;
        /**
         * Dequeue a number of contiguous object blocks.
         */
        rte_mempool_dequeue_contig_blocks_t dequeue_contig_blocks;
} __rte_cache_aligned;

And that there is a way to add memory to a pool:


int
rte_mempool_populate_virt(struct rte_mempool *mp, char *addr,
        size_t len, size_t pg_sz, rte_mempool_memchunk_free_cb_t *free_cb,
        void *opaque);

So maybe we can move Seastar memory into the pool.


On Wednesday, 8 May 2019 02:26:07 UTC+8, Avi Kivity wrote:

Certainly an update of dpdk would be welcome. It's likely not to be an easy task, because certain APIs we depend on were removed.


One thing I wanted was to start using iommu instead of hugepage pinning + /proc/$$/pagemap, which I think it part of the API change. So we may have to switch to iommu as part of the upgrade.


On 07/05/2019 18.45, Yibo Cai wrote:
Seastar is still using DPDK 17.05, which is more than 2 years old. DPDK API has changed a lot and current Seastar code failed to build with latest DPDK 19.02.
E.g., some xmem functions(rte_mempool_xmem_create, etc) used in current code are dropped since DPDK 18.08.

I'm having trouble building and running Seastar+DPDK on Mellanox ConnectX NICs.
To run DPDK on Mallanox NICs, there are heavy dependencies among OFED lib/driver (provided by Mellanox), distro version, firmware version, DPDK version and Kernel version. Best way is to use latest versions for all components, which are verified by both Mellanox and DPDK community.

With 17.05 DPDK, I have to downgrade OFED, kernel and even the distro(ubuntu18.04 to 16.04) as there's no downgraded version OFED for ubuntu18.04. I don't think it's the correct way, and it won't work on Arm as only latest OFED supports Arm.

So I'm thinking to upgrade Seastar to support DPDK 19.02. I have some Mellanox ConnectX-4,5 and Intel 82599 for testing.

I'd like to know if this upgrade is welcomed, and is there any concern. Thanks.
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylla...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/6df4cb39-15b6-4245-a475-3bd6d31f589c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/a49c2474-9401-42f7-8076-c228b6b59ba1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yibo Cai

<yibo.cai@linaro.org>
unread,
May 23, 2019, 10:02:38 AM5/23/19
to Avi Kivity, ScyllaDB development, seastar-dev, Takuya ASADA
With some terrible workarounds, I managed to run seastar with dpdk 19.02.
Following https://github.com/scylladb/seastar/wiki/HTTPD-benchmark, I
validated dpdk on intel 82599 and mellanox mt27700.

That said, I'm still struggling with some problems.

Seastar manages memory in two different ways.

One approach is seastar manages all normal memory(and THP), but let
dpdk uses statically allocated huge pages, example code path at
https://github.com/scylladb/seastar/blob/master/src/net/dpdk.cc#L1177-L1186.
All the complexities(hugepage, vfio, mellanox IB lib) are all handled
by dpdk itself. This method is ported and validated.

Another approach, when passing "--hugepages /dev/hugepages --memory
xxx", seastar will use statically allocated hugepages as its memory
backend. Seastar allocates external dpdk memory from its hugepage
pool. Example code path at
https://github.com/scylladb/seastar/blob/master/src/net/dpdk.cc#L1150-L1175.
I've ported code, still under debugging.

Avi mentioned "start using iommu instead of hugepage pinning +
/proc/$$/pagemap". I see seastar references /proc/$$/pagemap for va/pa
mapping, pa is used to manipulate dpdk mempool
(https://github.com/scylladb/seastar/blob/master/src/net/dpdk.cc#L1383).
Seastar doesn't support iova , while dpdk prefers vfio. I think this
is the gap, but not quite sure how to fix it. Any comment will be
appreciated.

Avi Kivity

<avi@scylladb.com>
unread,
May 23, 2019, 10:43:28 AM5/23/19
to Yibo Cai, ScyllaDB development, seastar-dev, Takuya ASADA

On 23/05/2019 17.02, Yibo Cai wrote:
> With some terrible workarounds, I managed to run seastar with dpdk 19.02.
> Following https://github.com/scylladb/seastar/wiki/HTTPD-benchmark, I
> validated dpdk on intel 82599 and mellanox mt27700.
>
> That said, I'm still struggling with some problems.
>
> Seastar manages memory in two different ways.
>
> One approach is seastar manages all normal memory(and THP), but let
> dpdk uses statically allocated huge pages, example code path at
> https://github.com/scylladb/seastar/blob/master/src/net/dpdk.cc#L1177-L1186.
> All the complexities(hugepage, vfio, mellanox IB lib) are all handled
> by dpdk itself. This method is ported and validated.


Is there a downside to only using this method (at least for now)? Will
it bounce memory around?


I'm willing to allow some degradation in functionality in return for
upgrading dpdk, but I'd like to hear from other users too.


>
> Another approach, when passing "--hugepages /dev/hugepages --memory
> xxx", seastar will use statically allocated hugepages as its memory
> backend. Seastar allocates external dpdk memory from its hugepage
> pool. Example code path at
> https://github.com/scylladb/seastar/blob/master/src/net/dpdk.cc#L1150-L1175.
> I've ported code, still under debugging.
>
> Avi mentioned "start using iommu instead of hugepage pinning +
> /proc/$$/pagemap". I see seastar references /proc/$$/pagemap for va/pa
> mapping, pa is used to manipulate dpdk mempool
> (https://github.com/scylladb/seastar/blob/master/src/net/dpdk.cc#L1383).
> Seastar doesn't support iova , while dpdk prefers vfio. I think this
> is the gap, but not quite sure how to fix it. Any comment will be
> appreciated.


I don't have deep knowledge in this area, but isn't iova just the
virtual address? then vfio translates the virtual address to a physical
address using the iommu.


If this works we can just get rid of pa mapping and use va everywhere.


From the vfio API:


struct vfio_iommu_type1_dma_map {
        __u32   argsz;
        __u32   flags;
#define VFIO_DMA_MAP_FLAG_READ (1 << 0)         /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)        /* writable from device */
        __u64   vaddr;                          /* Process virtual
address */
        __u64   iova;                           /* IO virtual address */
        __u64   size;                           /* Size of mapping
(bytes) */
};


So vaddr and iova are separate, but does anything prevent us from using
1:1 mapping? Of course, we want this to be done by dpdk, not Seastar.

Yibo Cai

<yibo.cai@linaro.org>
unread,
May 30, 2019, 7:05:14 AM5/30/19
to Avi Kivity, ScyllaDB development, seastar-dev, Takuya ASADA
Continue last mail, second data path(external dpdk buffer, zero copy)
is also working now. Things are looking promising :)

Can we create a local branch in scylladb dpdk repo to track latest
19.05 release dpdk code? It will help seastar dpdk upgrading.

For vfio+iommu support, seastar needs to setup iommu page tables to
map va to iova for external buffers not created by dpdk itself. From
dpdk sample codes, I see rte_malloc_heap_memory_add() does the trick.

I tried two approaches.

One approach is to map external buffers on demand. When application
sends a packet, to achieve zero copy, buffer address is written to
dpdk mbuf(current code writes pa, new code writes iova which equals
va), and the buffer range is mapped for dma by seastar(only for new
code). The problem of this approach is io page table maintenance. As
external buffers may share pages, across page boundaries, etc, we must
track all mapped pages to know how to correctly map new buffers. There
will be big overhead, both space and time. This approach is concept
validated okay, without page table maintenance.

Second approach is much simpler. It maps whole heap of each core at
dpdk initialization time. Heap area is get by
https://github.com/scylladb/seastar/blob/master/src/core/memory.cc#L1059-L1065.
We are okay as long as external buffers allocated will always lie in
this mapped area. Only changes to current code is to drop pa walking
and set iova = va. This approach is validated okay.
There are still some catches:
- Will application pass in a buffer located outside of that core heap?
local buffer at stack? global buffer at data/bss sections?
- Will some device has limited address space which breaks va/iova
directly mapping? (I see dpdk uses iova=va, so I guess it's not a big
problem)

I prefer the second approach. Would like to know if there's anything I
missed. Thanks.

Avi Kivity

<avi@scylladb.com>
unread,
May 31, 2019, 12:17:30 AM5/31/19
to Yibo Cai, ScyllaDB development, seastar-dev, Takuya ASADA
On 5/30/19 2:05 PM, Yibo Cai wrote:
> Continue last mail, second data path(external dpdk buffer, zero copy)
> is also working now. Things are looking promising :)
>
> Can we create a local branch in scylladb dpdk repo to track latest
> 19.05 release dpdk code? It will help seastar dpdk upgrading.


Pushed (local-19.05). But I hope we won't need local patches any more.


> For vfio+iommu support, seastar needs to setup iommu page tables to
> map va to iova for external buffers not created by dpdk itself. From
> dpdk sample codes, I see rte_malloc_heap_memory_add() does the trick.
>
> I tried two approaches.
>
> One approach is to map external buffers on demand. When application
> sends a packet, to achieve zero copy, buffer address is written to
> dpdk mbuf(current code writes pa, new code writes iova which equals
> va), and the buffer range is mapped for dma by seastar(only for new
> code). The problem of this approach is io page table maintenance. As
> external buffers may share pages, across page boundaries, etc, we must
> track all mapped pages to know how to correctly map new buffers. There
> will be big overhead, both space and time. This approach is concept
> validated okay, without page table maintenance.
>
> Second approach is much simpler. It maps whole heap of each core at
> dpdk initialization time. Heap area is get by
> https://github.com/scylladb/seastar/blob/master/src/core/memory.cc#L1059-L1065.
> We are okay as long as external buffers allocated will always lie in
> this mapped area. Only changes to current code is to drop pa walking
> and set iova = va. This approach is validated okay.


This is the approach that I like, it is fast and simple.


> There are still some catches:
> - Will application pass in a buffer located outside of that core heap?
> local buffer at stack? global buffer at data/bss sections?


I think these aren't realistic problems. The stack is not saved across a
continuation boundary, so you can't use it for I/O. seastar::thread
stacks are allocated from the shard heaps, so they will be mapped.
Globals are also unlikely to be used for packet buffers.


It's possible for seastar::alien users to pass non-seastar memory to
seastar and then attempt to send it (with zero copy) via tcp. For that,
we can add a registration API so the application will tell Seastar about
this memory, and Seastar can register the memory with dpdk. I don't
think we need to do this now, we can wait for the first user. Meanwhile
we'll recommend using Seastar memory for TCP.


> - Will some device has limited address space which breaks va/iova
> directly mapping? (I see dpdk uses iova=va, so I guess it's not a big
> problem)
>
> I prefer the second approach. Would like to know if there's anything I
> missed. Thanks.


I think this is the right track.


Will it work with VFIO_NOIOMMU_IOMMU? It offers no translation, so I
think it will not. This means we won't be able to use it on some cloud
virtual machines. I think we can live with this restriction and require
that there be a working iommu on the system.

Yibo Cai

<yibo.cai@linaro.org>
unread,
Jun 2, 2019, 10:45:47 PM6/2/19
to Avi Kivity, ScyllaDB development, seastar-dev, Takuya ASADA
On Fri, 31 May 2019 at 12:17, Avi Kivity <a...@scylladb.com> wrote:
>
> On 5/30/19 2:05 PM, Yibo Cai wrote:
> > Continue last mail, second data path(external dpdk buffer, zero copy)
> > is also working now. Things are looking promising :)
> >
> > Can we create a local branch in scylladb dpdk repo to track latest
> > 19.05 release dpdk code? It will help seastar dpdk upgrading.
>
>
> Pushed (local-19.05). But I hope we won't need local patches any more.
>

I checked all local patches to 17.05, three patches are not included
in upstream. Not sure if they are still required. Would someone have a
look?
- https://github.com/scylladb/dpdk/commit/15ba57cef920d60fc3d9ed6f4f852f56328c484a
- https://github.com/scylladb/dpdk/commit/3222373463a1e2ff24fbee7be3c26ee57234e388
- https://github.com/scylladb/dpdk/commit/be07b20eab7dc10ac3c681d2784e95654e82f833
I find application doesn't catch any fault when accessing unmapped dma
area, it just returns some garbage data (dmesg will show erros like
[DMA Write] ... PTE Write access is not set).
It looks not good for debugging. Maybe we can check the address range
of zero copy buffer and fallback to copy method if it's not within the
heap, or simply bugcheck.
Current code does similar thing when pa walk failed:
https://github.com/scylladb/seastar/blob/master/src/net/dpdk.cc#L972-L974

>
> > - Will some device has limited address space which breaks va/iova
> > directly mapping? (I see dpdk uses iova=va, so I guess it's not a big
> > problem)
> >
> > I prefer the second approach. Would like to know if there's anything I
> > missed. Thanks.
>
>
> I think this is the right track.
>
>
> Will it work with VFIO_NOIOMMU_IOMMU? It offers no translation, so I
> think it will not. This means we won't be able to use it on some cloud
> virtual machines. I think we can live with this restriction and require
> that there be a working iommu on the system.
>

I think it will only work with iommu enabled system, and the nic must
be bound to vfio-pci driver.

There's another tricky thing (endless of them:)
Mellanox nic doesn't use vfio, it depends on infiniband libs to access
hardware at user mode.
I was using rte_malloc_heap_memory_add() to map dma implicitly, it
hides vfio and ib difference, and works fine on both mellanox and
82599. But I don't like this api as it does much more things than
necessary, and it's marked as experimental.
I found a much simpler api rtf_vfio_dma_map() which does exactly what
we want, except that it doesn't support mellanox.
So I think we have to do some compromise: does not support zero copy
for mellanox nic, until better approach is found.

Avi Kivity

<avi@scylladb.com>
unread,
Jun 4, 2019, 6:04:00 AM6/4/19
to Yibo Cai, Vladislav Zolotarov, ScyllaDB development, seastar-dev, Takuya ASADA

On 03/06/2019 05.45, Yibo Cai wrote:
> On Fri, 31 May 2019 at 12:17, Avi Kivity <a...@scylladb.com> wrote:
>> On 5/30/19 2:05 PM, Yibo Cai wrote:
>>> Continue last mail, second data path(external dpdk buffer, zero copy)
>>> is also working now. Things are looking promising :)
>>>
>>> Can we create a local branch in scylladb dpdk repo to track latest
>>> 19.05 release dpdk code? It will help seastar dpdk upgrading.
>>
>> Pushed (local-19.05). But I hope we won't need local patches any more.
>>
> I checked all local patches to 17.05, three patches are not included
> in upstream. Not sure if they are still required. Would someone have a
> look?
> - https://github.com/scylladb/dpdk/commit/15ba57cef920d60fc3d9ed6f4f852f56328c484a
> - https://github.com/scylladb/dpdk/commit/3222373463a1e2ff24fbee7be3c26ee57234e388
> - https://github.com/scylladb/dpdk/commit/be07b20eab7dc10ac3c681d2784e95654e82f833


Vlad, can you please check?



Vladislav Zolotarov

<vladz@scylladb.com>
unread,
Jun 4, 2019, 1:20:47 PM6/4/19
to Avi Kivity, Yibo Cai, ScyllaDB development, seastar-dev, Takuya ASADA


On 5/31/19 12:17 AM, Avi Kivity wrote:
On 5/30/19 2:05 PM, Yibo Cai wrote:
Continue last mail, second data path(external dpdk buffer, zero copy)
is also working now. Things are looking promising :)

Can we create a local branch in scylladb dpdk repo to track latest
19.05 release dpdk code? It will help seastar dpdk upgrading.


Pushed (local-19.05). But I hope we won't need local patches any more.

I think there were some ixgbe PMD related patches that were not merged. They were addressing some real issues.
We may still need these if they are not part of the 19.05.

Vladislav Zolotarov

<vladz@scylladb.com>
unread,
Jun 4, 2019, 3:17:49 PM6/4/19
to Avi Kivity, Yibo Cai, ScyllaDB development, seastar-dev, Takuya ASADA
All these are required.





Reply all
Reply to author
Forward
0 new messages