Hi Radim,
Thanks a lot! It didn't occur to me that the default settings of the hyperparameters would be so different. I had tried setting alpha='auto' with little success, but after your post I set alpha =.5 (I think this is the default in mallet?), and the topics seem to have improved. I should probably go back and understand what that parameter really means...
As for preprocessing, I am directly using the ap.dat file from Blei's site and vocab.txt as the dictionary. I saw that the mallet wrapper uses a stopword set from the mallet package, so I downloaded those and removed them from the corpus before running Lda. Doing this didn't seem to help too much (before setting alpha), nor did preprocessing with tfidf. Anyways, I'll keep you updated as I experiment with the other hyperparamter, but definitely let me know what you find since I might be doing it all wrong :).
As for code, I'm happy to send you an email with the scraps that I have, but I'm not doing anything beyond what's mentioned above and then calling
lda_model = models.LdaModel(corpus=corpus,id2word=dictionary,num_topics=100,passes=10), and sometimes setting the other parameters as I'd mentioned. I set passes to 10 because I was getting a warning about too few passes.
Thanks again!