Regression analysis: genre variation in unbalanced corpora

Aug 26, 2020, 2:04:25 PM8/26/20
to StatForLing with R
Dear all,

I know that such questions do not relate to R programming, but since methodological issues are often discussed in the books, I venture to ask my question in the group.

Since  historical corpora are hardly genre-balanced (except for COHA, but what I'm working on is based on corpora with a greater time-depth), is it methodologically sound to use genre as a predictor in regression analysis when one is not sure about the genre distribution over the different periods of time? What if the corpus size is 1+ billion words (e.g. in EEBO)? Isn't there a way to do away with problems of overrepresentativity of an item once the corpus is very large?
