Hi Kyle,
it's always best use evaluate the quality directly on your end task,
the one you're trying to solve with these unsupervised models.
Substitute generic metrics like perplexity are also an option, but may
not necessarily correlate well with what you're really trying to
achieve (i.e., a model with lower perplexity could nevertheless
perform worse in practice).
Best,
Radim