Copying the seastar list as it's really a seastar problem.
I did some investigation about dpdk updates since 17.05. There are two major changes:
1. new memory subsystem
2. new offloads api
- and many other patches...
Moving to new offload api looks not that hard.
Regarding new memory sybsystem, current seastar code does complex jobs by allocating memory and creating mempool, mbufs manyally. My understanding is it's due to old dpdk eats up all hugepages, so doing it manually to save memory.If this is the reason, I guess it can be significantly simplified as latest dpdk allocates hugepages on demand.
There are two reasons for the existing solution:
1. Accounting. Seastar allocates all system memory and the Seastar allocator knows how much memory is free, and calls a reclaimer when memory is low or when it cannot satisfy an allocation. This strategy requires that it knows about all allocations. If DPDK can also allocate memory on demand, Seastar has to to leave that memory unallocated. If it leaves too little memory, DPDK will allocate too much and the kernel will run out of memory. If it leaves too much memory, then the application will not be able to use it.
2. Zero-copy: the tcp stack works with temporary_buffers so we can pass the received packets directly to the application. I guess we can still do that by attaching custom deleters to the buffers that will call the DPDK APIs instead of ::free(). Custom deleters can be expensive (require an extra allocation), but maybe it's not too bad, and maybe we can improve that somehow.
I see mempool operations are virtualized:
/** Structure defining mempool operations structure */
struct rte_mempool_ops {
char name[RTE_MEMPOOL_OPS_NAMESIZE]; /**< Name of
mempool ops struct. */
rte_mempool_alloc_t alloc; /**< Allocate private
data. */
rte_mempool_free_t free; /**< Free the external
pool. */
rte_mempool_enqueue_t enqueue; /**< Enqueue an
object. */
rte_mempool_dequeue_t dequeue; /**< Dequeue an
object. */
rte_mempool_get_count get_count; /**< Get qty of
available objs. */
/**
* Optional callback to calculate memory size required to
* store specified number of objects.
*/
rte_mempool_calc_mem_size_t calc_mem_size;
/**
* Optional callback to populate mempool objects using
* provided memory chunk.
*/
rte_mempool_populate_t populate;
/**
* Get mempool info
*/
rte_mempool_get_info_t get_info;
/**
* Dequeue a number of contiguous object blocks.
*/
rte_mempool_dequeue_contig_blocks_t dequeue_contig_blocks;
} __rte_cache_aligned;
And that there is a way to add memory to a pool:
int
rte_mempool_populate_virt(struct rte_mempool *mp, char *addr,
size_t len, size_t pg_sz, rte_mempool_memchunk_free_cb_t
*free_cb,
void *opaque);
So maybe we can move Seastar memory into the pool.
On Wednesday, 8 May 2019 02:26:07 UTC+8, Avi Kivity wrote:--Certainly an update of dpdk would be welcome. It's likely not to be an easy task, because certain APIs we depend on were removed.
One thing I wanted was to start using iommu instead of hugepage pinning + /proc/$$/pagemap, which I think it part of the API change. So we may have to switch to iommu as part of the upgrade.
On 07/05/2019 18.45, Yibo Cai wrote:
--Seastar is still using DPDK 17.05, which is more than 2 years old. DPDK API has changed a lot and current Seastar code failed to build with latest DPDK 19.02.E.g., some xmem functions(rte_mempool_xmem_create, etc) used in current code are dropped since DPDK 18.08.
I'm having trouble building and running Seastar+DPDK on Mellanox ConnectX NICs.To run DPDK on Mallanox NICs, there are heavy dependencies among OFED lib/driver (provided by Mellanox), distro version, firmware version, DPDK version and Kernel version. Best way is to use latest versions for all components, which are verified by both Mellanox and DPDK community.
With 17.05 DPDK, I have to downgrade OFED, kernel and even the distro(ubuntu18.04 to 16.04) as there's no downgraded version OFED for ubuntu18.04. I don't think it's the correct way, and it won't work on Arm as only latest OFED supports Arm.
So I'm thinking to upgrade Seastar to support DPDK 19.02. I have some Mellanox ConnectX-4,5 and Intel 82599 for testing.
I'd like to know if this upgrade is welcomed, and is there any concern. Thanks.
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylla...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/6df4cb39-15b6-4245-a475-3bd6d31f589c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/a49c2474-9401-42f7-8076-c228b6b59ba1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
On 5/30/19 2:05 PM, Yibo Cai wrote:
Continue last mail, second data path(external dpdk buffer, zero copy)
is also working now. Things are looking promising :)
Can we create a local branch in scylladb dpdk repo to track latest
19.05 release dpdk code? It will help seastar dpdk upgrading.
Pushed (local-19.05). But I hope we won't need local patches any more.