Latent Dirichlet Allocation with RTextTools + topicmodels

62 views
Skip to first unread message

Tim Jurka

unread,
Aug 30, 2011, 3:12:56 PM8/30/11
to rtextto...@googlegroups.com
Hi team,

I've posted a tutorial on the RTextTools blog showing how to do Latent Dirichlet Allocation with RTextTools + topicmodels...

http://www.rtexttools.com/1/post/2011/08/getting-started-with-latent-dirichlet-allocation-using-rtexttools-topicmodels.html

Best,
Tim

Loren Collingwood

unread,
Aug 30, 2011, 3:18:19 PM8/30/11
to rtextto...@googlegroups.com

Looking good. One thing that came up @ useR! conference was whether rtexttools uses bag of words or can include n-gram/ sentiment analysis. Probably should be considered as long term goal of the project

Tim Jurka

unread,
Aug 30, 2011, 3:21:08 PM8/30/11
to rtextto...@googlegroups.com
Yeah I want to include support for bigrams, but at the moment that requires RWeka and I don't want any Java components in this package (it just complicates things).

What do you mean by sentiment analysis?

Tim

Wouter van Atteveldt

unread,
Aug 30, 2011, 5:04:10 PM8/30/11
to rtextto...@googlegroups.com
Looking at the blog post right now. Do you have any idea of the lda package vs the topicmodels package?

Wouldn't bigrams really require some sort of custom handling of the text -> td matrix conversion?

-- Wouter

Tim Jurka

unread,
Aug 30, 2011, 5:09:13 PM8/30/11
to rtextto...@googlegroups.com
I went with topicmodels because it was better documented. The lda() or slda() functions aren't even documented in the lda package documentation as far as I can tell.

Bigrams simply requires passing in a tokenizer function to the control() list of the tm DocumentTermMatrix function. RWeka has an NGramTokenizer function, but this requires Java. I'm looking into writing a C++ n-gram tokenizer, or alternatively, making RWeka a suggested package and letting users install it if they want an n-gram tokenizer.

I just really hate R + Java… it's a match made in hell.

Tim

Wouter van Atteveldt

unread,
Aug 30, 2011, 5:22:51 PM8/30/11
to rtextto...@googlegroups.com
I hate java in general, so we're square there :-)

Presumably a n-gram tokenizer shouldn't be horribly difficult in C, assuming you can build on the existing tokenizer (figuring out when a period is a token boundary is not fun..).

I'll have a look at the topicmodels package. The lda.collapsed.gibbs.sampler works fine but it wants a list-of-termfrequency-lists rather than a normal td matrix of triplet, which is quite annoying; and explicit support for correlated topic models sounds interesting. Do you know if the topicmodels supports "predicting" the topic of out of sample documents?

-- Wouter

Tim Jurka

unread,
Aug 30, 2011, 9:34:03 PM8/30/11
to rtextto...@googlegroups.com
Yes it does. Once you've generated the LDA model, use terms(LDA) to get the most likely terms for each topic, and topics(LDA) to get the most likely topic for each document.

Tim

Wouter van Atteveldt

unread,
Aug 30, 2011, 9:41:57 PM8/30/11
to rtextto...@googlegroups.com
Yeah that's the same as lda.

What I mean is: say I've generated a topic model, and then get new articles, and which to see how the old topic model classifies the new articles.

A related use case would be to 'update' the topic model with the new articles, eg run a new model from the old assignments as starting values, possibly with some extra topics to account for the larger data set. I know this is possible for the method, I just don't know if it can be done easily using the existing tools...

-- Wouter

Tim Jurka

unread,
Aug 30, 2011, 9:44:58 PM8/30/11
to rtextto...@googlegroups.com
No, it does not support supervised latent Dirichlet allocation.

Tim

Reply all
Reply to author
Forward
Message has been deleted
0 new messages