In
order to measure or quantify anything, you first need to define
precisely the thing that you are trying to measure or quantify. In this
case, you mention "quality of training documents". If you're feeding LDA
a "bag of words", that is LDA's exclusive view of your documents.
Each
document is a bag of words. Are the words in those "bags" semantically
meaningful, often misspelled, or full of markup notations from wiki
pages or html? Or are stopwords removed, words spelled correctly, with
no markup? Are the documents typically short (e.g., tweets), typically
long (better for LDA), or is document length normally distributed? Those are just examples.
GIGO -- garbage in, garbage out. You can always find ways to
artificially reduce the "quality" of your documents and then run LDA using identical hyperparameters with the "good training set" and "bad training set". If you're using gensim, then compare perplexity between the two results. However, I'm not personally convinced that any purely human-out-of-loop approach is the "answer" for evaluating topic model quality.