Hi Darren,
Hi again,
I indexed all the short abstracts on wikipedia and was able to
build
the semantic vectors, etc.
Interesting. Could you post a link to where you get these abstracts
from? I know a lot of people are using Wikipedia as a corpus nowadays,
so it's a very good thing to be involved in.
- For 1.1 million documents, Lucene took 3 hours to index.
- SemanticVectors took 4 minutes to generate vectors over that
index.
IMPRESSIVE!!
That's great!
Running some tests I get some varied and at times unexpected
results.
Consider below. [1] produces some expected terms such as
pitcher,major
league, but also 'elephants'. Hehe. For [2] I don't really see any
terms
I would expect to see, especially 'snoopy'. Given that the data
set is
ALL of wikipedia abstracts I would expect more semantically
relevant
vectors because of the breadth of documents.
Yes, I'm surprised at the politics result.
Things you could try include:
i. using more dimensions (-d option).
ii. using more training cycles (-tc option).
iii. building a positional index (see
http://code.google.com/p/semanticvectors/wiki/PositionalIndexes)
Of these, I would expect iii. to make the most difference, though
you'll probably need to build a whole new Lucene index.
My question is: Is there a data set and examples that demonstrate
clustering, LSA, and those sorts of things more readily?
I tend to use the bible, because it's small and easy - and contrary to
what is often said in the literature, it seems that you can build a
perfectly good model using less than 2 million words in your corpus.
I have scraped data from the web in the past, with some good results
and some not so good. (It's pretty easy to implement if you're willing
to pick a list of sites and then run wget ovet this batch - but if you
start to do this for more than a few small sites, check that your
institution / infrastructure is OK with this, use settings that
respect robots.txt, etc. - I know this can all be done but I wouldn't
advise anyone to do it without properly considering the implications.)
Also I've used the European Parliament data and the Ohsumed corpus for
medical stuff - they're easy to find on the web as well.
But really I'd be surprised if any of these is better than Wikipedia
as a source for this sort of work - hopefully there are ways of
getting better results here.
Another thing to consider is method rather than data - on my vague
TODO list for a while has been the idea of adding support for SVD as
well as random projection into semantic vectors. It really shouldn't
be too hard - though I'd be amazed if there weren't serious
performance problems.
Thanks for keeping in touch - please keep asking and posting results,
this is great for our small community.
Best wishes,
Dominic