Hi Mario,
This is absolutely possible, you just need to be a little creative.
What the SSR-WFS-Renderer does is computing loudspeaker feeds. What you need to do signal processing-wise is filling the gap between the loudspeaker feeds and the virtual microphones the signals of which you are ultimately interested in. Assuming that the loudspeakers are also omni and that the room is anechoic, the signal that a virtual microphone captures from a given loudspeaker depends only on the distance between the two. The distance causes a delay and an attenuation.
There are many ways of implementing this. Here are 2 examples:
1) Record the output of the WFS renderer (= the loudspeakers feeds, check the SSR command line arguments for how to do that), then apply the appropriate delays and attenuation in a separate software like MATLAB or Python or so and add up the delayed and attenuated loudspeaker signals. And voilà!
2) Daisy-chain two SSRs. The first one uses a WFS renderer (to produce the loudspeaker feeds), and the second one uses, for example, the Generic renderer into which you load impulse responses that apply the required delay and attenuation to produce one output signal, which is the signal from the virtual microphone.
Good luck!
Best regards,
Jens