Regarding the "quality" of a level generator, I think perhaps we should first agree on what is meant by "quality". The metrics identified in our paper that I think you're talking about were explicitly not intended to be used to determine what makes a "better" generator. I would posit that it may be impossible to actually answer what makes a "better" generator without a concrete game context, and even then, it may not be possible or desirable to do so. It is perhaps more important to look at the impact of the generative space on player experience or the desired game aesthetic. It really depends on the goals that you have in making a generator.
To answer your specific questions:
1) Maybe? But we don't really know for sure. This sounds like a great set of research papers in itself! :)
2) This depends entirely on your goals with the generator, what kind of game or tool you want to use it in, and how general purpose you want it to be. I believe that this question is impossible to answer as it is formulated. You could imagine a generator that could make only one beautiful and emotionally resonant level. I would hate to call this hypothetical generator "worse" than one that can create thousands upon thousands of mediocre pieces of content. Size doesn't matter.
3) Again, I think the term "quality" might be overloaded here....
4) There is a lot of research to be done in identifying new metrics! Defining metrics that are relevant to your design context is a perfectly reasonable thing to do.
In general, my goal with evaluation is to get a nuanced understanding of generative space and expressive range. My "generative space", I mean the properties of content that can be generated and the "shape" of that space. By "expressive range", I mean how the generative space responds to input, and how design-relevant the input controls are.