This is really exactly what WVR was designed for, so you are definitely in the right place!
I would say yes, it is possible to accomplish all this with WVR.
however...
I think it would be much easier to accomplish if you were to handle the screen and other peripherals with an Uno, and use WVR just for the audio, by sending it MIDI data.
WVR is highly optimized for Audio, so the more work it does on other tasks, the less it is able to devote to processing all those 18 voices of polyphony.
There is also a learning curve with programming for ESP32, even with Arduino, with freeRTOS tasks, setting up FTDI and entering bootloader mode, getting the stack traces and logs working well, etc..
If you are comfortable with teensy or Uno, this is a fool proof solution, where WVR can use its stock firmware, and be very reliable.
Using midi in Arduino is not hard to learn, there are fantastic libraries to make it very intuitive.