I know that such questions do not relate to R programming, but since methodological issues are often discussed in the books, I venture to ask my question in the group.
Since historical corpora are hardly genre-balanced (except for COHA, but what I'm working on is based on corpora with a greater time-depth), is it methodologically sound to use genre as a predictor in regression analysis when one is not sure about the genre distribution over the different periods of time? What if the corpus size is 1+ billion words (e.g. in EEBO)? Isn't there a way to do away with problems of overrepresentativity of an item once the corpus is very large?