I've posted a tutorial on the RTextTools blog showing how to do Latent Dirichlet Allocation with RTextTools + topicmodels...
Best,
Tim
Looking good. One thing that came up @ useR! conference was whether rtexttools uses bag of words or can include n-gram/ sentiment analysis. Probably should be considered as long term goal of the project
Wouldn't bigrams really require some sort of custom handling of the text -> td matrix conversion?
-- Wouter
Bigrams simply requires passing in a tokenizer function to the control() list of the tm DocumentTermMatrix function. RWeka has an NGramTokenizer function, but this requires Java. I'm looking into writing a C++ n-gram tokenizer, or alternatively, making RWeka a suggested package and letting users install it if they want an n-gram tokenizer.
I just really hate R + Java… it's a match made in hell.
Tim
Presumably a n-gram tokenizer shouldn't be horribly difficult in C, assuming you can build on the existing tokenizer (figuring out when a period is a token boundary is not fun..).
I'll have a look at the topicmodels package. The lda.collapsed.gibbs.sampler works fine but it wants a list-of-termfrequency-lists rather than a normal td matrix of triplet, which is quite annoying; and explicit support for correlated topic models sounds interesting. Do you know if the topicmodels supports "predicting" the topic of out of sample documents?
-- Wouter
Tim
What I mean is: say I've generated a topic model, and then get new articles, and which to see how the old topic model classifies the new articles.
A related use case would be to 'update' the topic model with the new articles, eg run a new model from the old assignments as starting values, possibly with some extra topics to account for the larger data set. I know this is possible for the method, I just don't know if it can be done easily using the existing tools...
-- Wouter
Tim