Issue 521 in nltk: latent Dirichlet allocation style topic models

37 views
Skip to first unread message

nl...@googlecode.com

unread,
Mar 8, 2010, 8:20:39 AM3/8/10
to nltk-...@googlegroups.com
Status: New
Owner: jordanbg
Labels: Type-Project Priority-Medium

New issue 521 by jordanbg: latent Dirichlet allocation style topic models
http://code.google.com/p/nltk/issues/detail?id=521

I think an LDA implementation would be a good fit for NLTK; I know that
people have asked about it in the past, and I think it's simple enough in
scope that it could get implemented fairly easily and then refined so that
it's very usable and useful.

It would also be useful to have an unsupervised method in NLTK to contrast
with the mostly supervised algorithms.

Here's a paper describing one application of LDA:
http://www.pnas.org/content/101/suppl.1/5228.full

Here's a paper useful for implementation:
http://www.arbylon.net/publications/text-est.pdf


--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings

nl...@googlecode.com

unread,
Jul 24, 2010, 4:32:33 PM7/24/10
to nltk-...@googlegroups.com
Updates:
Status: Wiki

Comment #1 on issue 521 by StevenBird1: latent Dirichlet allocation style
topic models
http://code.google.com/p/nltk/issues/detail?id=521

Moved to NLTK wiki

nl...@googlecode.com

unread,
Mar 27, 2011, 3:28:15 PM3/27/11
to nltk-...@googlegroups.com

Comment #2 on issue 521 by breno.al...@gmail.com: latent Dirichlet

Hi, I'd like to know if someone already contributed an LDA implementation
to NLTK. If not, I would be willing to do it.

nl...@googlecode.com

unread,
Mar 27, 2011, 4:44:45 PM3/27/11
to nltk-...@googlegroups.com
Updates:
Cc: nmadn...@gmail.com

Comment #3 on issue 521 by jorda...@gmail.com: latent Dirichlet allocation


I don't think so - I had made the offer that if document iterators were
added to nltk corpora, I'd contribute an lda package. Those iterators
haven't happened, so I haven't held up my end. :)

That being said, you're welcome to clean / vet my (alpha) pylda code and
incorporate it into nltk:

http://topicmod.googlecode.com/svn/trunk/projects/pylda/src/pylda.py

I'm happy to help with any advice / issues. I'm cc'ing Nitin, who has also
played with the code.

Best,

Jordan

nl...@googlecode.com

unread,
Mar 27, 2011, 5:50:32 PM3/27/11
to nltk-...@googlegroups.com
Updates:
Cc: StevenBird1

Comment #4 on issue 521 by StevenBird1: latent Dirichlet allocation style
topic models
http://code.google.com/p/nltk/issues/detail?id=521

What new interface do you need to the corpora? An iterator where each item
produced is a list of the words of a document? Why does it need to be an
iterator (our corpus readers don't load the whole thing into memory anyway).


nl...@googlecode.com

unread,
Mar 27, 2011, 6:07:36 PM3/27/11
to nltk-...@googlegroups.com

Comment #5 on issue 521 by jorda...@gmail.com: latent Dirichlet allocation

Iterator was a too technical choice of word, sorry. It doesn't need to be
an iterator per se. Something akin to "sentences", which allows you to go
through each sentence as an atomic unit.

Many of the corpora have documents as logical units, but there is now way
to access them at that level. E.g. Brown, Treebank, etc. Some do, e.g.
Europarl has "chapters," but this is pretty ad hoc, and there is no
standard way to access documents in a corpus like there is for sentences or
words.

If I'm mistaken, please let me know!

nl...@googlecode.com

unread,
Mar 27, 2011, 6:11:37 PM3/27/11
to nltk-...@googlegroups.com

Comment #6 on issue 521 by StevenBird1: latent Dirichlet allocation style
topic models
http://code.google.com/p/nltk/issues/detail?id=521

Many corpus readers support access to the underlying files, and these files
sometimes correspond to documents. We could discuss a better interface,
but this might meet your needs. E.g.:

>>> from nltk.corpus import brown
>>> for fid in brown.fileids():
... print fid, len(brown.words(fileids=fid))
ca01 2242
ca02 2277
ca03 2275
.... ....

nl...@googlecode.com

unread,
Mar 27, 2011, 6:34:43 PM3/27/11
to nltk-...@googlegroups.com

Comment #7 on issue 521 by jorda...@gmail.com: latent Dirichlet allocation

While it's true that the files in corpus objects *sometimes* correspond to
documents in the corpus, that's not always a safe assumption (e.g. the nps
chat transcripts or Europarl). It also blurs data representation with the
semantics of the data. Are the files of senseval documents? No, those are
sentences drawn from many different documents.

I'd argue for a first-class "documents" function to make it explicit for
the corpora that do have underlying documents, but "files" would work for a
limited number of corpora now.

nl...@googlecode.com

unread,
Mar 27, 2011, 6:45:45 PM3/27/11
to nltk-...@googlegroups.com

Comment #8 on issue 521 by StevenBird1: latent Dirichlet allocation style
topic models
http://code.google.com/p/nltk/issues/detail?id=521

I agree. It would be helpful if you would open a new "feature request"
issue about this please.

Then we need to agree method names and signatures (documents(),
tagged_documents()?, parsed_documents()?...), and which corpora should have
these methods.


nl...@googlecode.com

unread,
Mar 27, 2011, 6:53:46 PM3/27/11
to nltk-...@googlegroups.com

Comment #9 on issue 521 by jorda...@gmail.com: latent Dirichlet allocation


There's issue 375 - my (very old) attempt to create an issue specific to
documents.

nl...@googlecode.com

unread,
Mar 27, 2011, 10:58:23 PM3/27/11
to nltk-...@googlegroups.com

Comment #10 on issue 521 by nmadn...@gmail.com: latent Dirichlet allocation

I agree: a document level set of accessor functions for corpora would be
useful for many things. Jordan and I had talked about this and
the "chapters" method I implemented for the Europarl corpus reader is kind
of an approximation to this but certainly far from ideal.

I would be happy to help with the discussion and implementation.

nl...@googlecode.com

unread,
Mar 28, 2011, 12:07:02 AM3/28/11
to nltk-...@googlegroups.com

Comment #11 on issue 521 by StevenBird1: latent Dirichlet allocation style
topic models
http://code.google.com/p/nltk/issues/detail?id=521

Please see issue 375 for ongoing discussion of document readers.


nl...@googlecode.com

unread,
Sep 21, 2013, 3:44:21 PM9/21/13
to nltk-...@googlegroups.com

Comment #12 on issue 521 by Artem.Ya...@gmail.com: latent Dirichlet
Was LDA ever implemented?

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

nl...@googlecode.com

unread,
Sep 23, 2013, 11:27:43 PM9/23/13
to nltk-...@googlegroups.com
Updates:
Owner: StevenBird1

Comment #13 on issue 521 by StevenBird1: latent Dirichlet allocation style
topic models
http://code.google.com/p/nltk/issues/detail?id=521

No -- feel free to propose it, or contribute code, at
https://groups.google.com/forum/#!forum/nltk-dev or
https://github.com/nltk/nltk
Reply all
Reply to author
Forward
0 new messages