As I said below, this is no longer part of my 'recall' memory. It's been many years since I looked at the existing research on the subject, and I've lost most of my old links.
A few related things I did find:
Landay I had looked up regarding non-speech voice control (apparently, it's 150% faster), and I recall some of the experiments being similar to programming. I've never actually read Blair's paper, just 'saved it for later' then forgot about it. It looks fascinating.
VRML sucks. X3D sucks only marginally less.
If your interest is representation of structure, I suggest abandoning any fixed-form meshes and focusing on procedural generation. Procedurally generated scenegraphs - where 'nodes' can track rough size, occlusion, and rough brightness/color properties (to minimize pop-in) - can be vastly more efficient, reactive, interactive, have finer 'level of detail' steps. (Voxels are also interactive, but have a relatively high memory overhead, and they're ugly.) Most importantly, PG content can also be 'adaptive' - i.e. pieces of art that partially cooperate with their context to fit themselves in.
If I ever get back to this subject in earnest, I'll certainly be pursuing a few hypotheses that I haven't found opportunity to test:
But even if those don't work out, the procedural generation communities have a lot of useful stuff to say on the subject of VR.
I haven't paid attention to VWF. If you haven't done so, you should look into Croquet and OpenCobalt.
Best,
Dave