Hello All,
I have a quick update on (unsuccessful) results
from using topic models for the challenge.
The topic model worked fine, but I think to get interesting
semantic words I would need a lot more topics.
https://github.com/ivanistheone/Latent-Dirichlet-Allocation/blob/master/semrelwords/RESULTS.txt
> > 4. Then to find semantically related terms for word "w" do as follows:
> > 4.1 Find most likely topic
> > t* = argmax_{t} p(W=w|t)
> > 4.2 Print the top ten words from topic t*
> > sem_rel = [ select 10 from w ordered by p(w|t*) ]
>
> This seems like a reasonable approach, although I wonder if it would
> be possible to integrate over all topics.
>
Yes, certainly. We could order by
sum_t p(w|t*)p(t|w=query),
which is a form I have seen in several papers.
I was hoping that picking the "top" topic I would
be picking only one meaning of the word -- and
avoid problems with polysemy.
> > 1M is a serious vocabulary size, this is why I wanted to play with
> > this, but I won't be finished until Wednesday or so.
> > Let me know if you take late submissions....
>
> Sure, submit it then and I will include the results.
>
I don't think I will be able to continue this exploration this
week, but I will keep going back to this data set as I flesh
out the functionality of `liblda`.
The data set is so noisy/lacking in information that it makes
a good text of what ML techniques can extract from very
noisy and short documents...
Thanks for posting the challenge!
It has been entertaining...
Ivan