LDA isn't my area of expertise, but I suspect spreading the training out over multiple processes changes the optimization process enough that results at the end can differ. Changing parameters like `passes` or `iterations` might again bring them closer - especially, giving up some of the multicore speedup to train longer.
Curious:
* What sort of "issues" did you initially have with `LdaMulticore`? Given that you're showing final results with `LdaMulticore`, & seem to prefer them, what's the difference between the results you're showing and (some other?) setup where it's unusable?
* How many processors on the system (& thus effective value of `workers` for `LdaMulticore`)?
* How much of a speedup are you seeing in `LdaMulticore`?
* What makes you prefer the higher final `avg_max_topic` value? (Is there some tangible downstream task on which the higher values give better results than the lower values?)
* That your graph shows such a big shift between 'early' and 'late' documents is a bit suspicious; clumping of similar documents in long runs during training can impair generalizable learning (as incremental batches individually don't have full range of corpus variety). What changes if you shuffle the documents, to eliminate any similar-document clumps, before training?
- Gordon