For large collections of documents, the number of dimensions used is in the 100 to 500 range. In our little example, since we want to graph it, we’ll use 3 dimensions, throw out the first dimension, and graph the second and third dimensions.
The reason we throw out the first dimension is interesting. For documents, the first dimension correlates with the length of the document. For words, it correlates with the number of times that word has been used in all documents. If we had centered our matrix, by subtracting the average column value from each column, then we would use the first dimension. As an analogy, consider golf scores. We don’t want to know the actual score, we want to know the score after subtracting it from par. That tells us whether the player made a birdie, bogie, etc.
Ok, I can retrieve U, S, and V after LSI using the FAQ recipe. And I can get the documentsXtopics matrix after LSI by multiplying V and S (which, with the proper TF-IDF parameters, yields the exact same result as sklearn's TruncatedSVD() followed by fit_transform() - which is reassuring).
But how do I get the documentsXtopics matrix after LDA?