Just a broad-stroke question...
I understand that Maglev processes packets without kernel involvement. Does that mean that you're using Raw Sockets? Or, some other technique?
Is it reasonable to consider Intel DPDK where the driver for several Intel MAC controllers (X540, 82599EB etc.) run in user space and allows packet movements between NIC and user-space without Kernel involvement? This can also completely eliminate the need for context switches.
I've run Intel DPDK and do see PPS number to saturate 10GbE interface. I didn't run 64B packets but more in the 400B to 600B range average packet size.