Latent Dirichlet Allocation

ivan

unread,

Nov 21, 2010, 12:15:31 PM11/21/10

to MetaOptimize Challenge [discuss]

Hello All,

I have been trying to fit an LDA model to the data set posted
in the blog post. (Was there an updated data set or is the one
posted originally the "final" data set?)

I don't think I will be done by today, but I still wanted to share
with you the approach which I am following.

1. Define each line in the data set to be one document (total of D)
2. Define the vocabulary to the be 1M unique words (total of V)

The inputs are word counts for each line in input.
3. Fit an LDA model with T topics, (best \alpha, \beta
hyperparameters?)
The outputs are:
p(t|d) a DxT matrix (proportion of topic t in document d) (not
used)
p(w|t) a TxV matrix (probability of finding word w in topic t)
z -- specific assignments of topics to each word in corpus (not used)

4. Then to find semantically related terms for word "w" do as follows:
4.1 Find most likely topic
t* = argmax_{t} p(W=w|t)
4.2 Print the top ten words from topic t*
sem_rel = [ select 10 from w ordered by p(w|t*) ]

1M is a serious vocabulary size, this is why I wanted to play with
this, but I won't be finished until Wednesday or so.

Let me know if you take late submissions....

Peace,
Ivan

PS: I didn't introduce myself. My name you already know,
and my description is a phd student at McGill. I used to study
information theory, but now I find ML to be so much cooler and
more useful in real life. If anyone wants to discuss topic models
in the Montreal area be sure to contact me.

Joseph Turian

unread,

Nov 23, 2010, 12:19:55 AM11/23/10

to MetaOptimize Challenge [discuss], ivan

Ivan,

> 4. Then to find semantically related terms for word "w" do as follows:
> 4.1 Find most likely topic
> t* = argmax_{t} p(W=w|t)
> 4.2 Print the top ten words from topic t*
> sem_rel = [ select 10 from w ordered by p(w|t*) ]

This seems like a reasonable approach, although I wonder if it would
be possible to integrate over all topics.

> 1M is a serious vocabulary size, this is why I wanted to play with
> this, but I won't be finished until Wednesday or so.
> Let me know if you take late submissions....

Sure, submit it then and I will include the results.

> PS: I didn't introduce myself. My name you already know,
> and my description is a phd student at McGill. I used to study
> information theory, but now I find ML to be so much cooler and
> more useful in real life. If anyone wants to discuss topic models
> in the Montreal area be sure to contact me.

Shoot me an email. I live in Mile End.

Best,
Joseph

ivan

unread,

Nov 25, 2010, 7:08:21 PM11/25/10

to MetaOptimize Challenge [discuss]

Hello All,

I have a quick update on (unsuccessful) results
from using topic models for the challenge.
The topic model worked fine, but I think to get interesting
semantic words I would need a lot more topics.
https://github.com/ivanistheone/Latent-Dirichlet-Allocation/blob/master/semrelwords/RESULTS.txt

> > 4. Then to find semantically related terms for word "w" do as follows:
> > 4.1 Find most likely topic
> > t* = argmax_{t} p(W=w|t)
> > 4.2 Print the top ten words from topic t*
> > sem_rel = [ select 10 from w ordered by p(w|t*) ]
>
> This seems like a reasonable approach, although I wonder if it would
> be possible to integrate over all topics.
>

Yes, certainly. We could order by
sum_t p(w|t*)p(t|w=query),
which is a form I have seen in several papers.

I was hoping that picking the "top" topic I would
be picking only one meaning of the word -- and
avoid problems with polysemy.

> > 1M is a serious vocabulary size, this is why I wanted to play with
> > this, but I won't be finished until Wednesday or so.
> > Let me know if you take late submissions....
>
> Sure, submit it then and I will include the results.
>

I don't think I will be able to continue this exploration this
week, but I will keep going back to this data set as I flesh
out the functionality of `liblda`.
The data set is so noisy/lacking in information that it makes
a good text of what ML techniques can extract from very
noisy and short documents...

Thanks for posting the challenge!
It has been entertaining...

Ivan

Reply all

Reply to author

Forward