Retraining document and term vectors, and refactoring the interface to sparse vector stores

19 views

Skip to first unread message

Dominic

unread,

Jul 15, 2008, 5:02:49 PM7/15/08

to Semantic Vectors

Dear All,

This post relates to a few semantic and engineering issues, so I'd
very much appreciate and comments and testing people are willing to
do.

For a while, it's been noted that the process of creating the semantic
vectors could be run in several training cycles. Currently,
SemanticVectors does the following:
i. Generates random basic document vectors.
ii. Trains term vectors from these (these are written by default to
termvectors.bin).
iii. Creates learned document vectors from these term vectors (these
are by default written to docvectors.bin).
Clearly we can take the output in step iii and reuse it as the input
in step i and repeat the process. I've just committed code that does
this.

There is a complication - the document vectors for retraining are no
longer sparse, so the memory-efficient sparse format Trevor
implemented doesn't easily work for retraining. Two options present
themselves:
i. Replace the explicit declaration that basicDocVectors should be
short[][] with a VectorStore interface that may be a Hashtable of
sparse short[] vectors or may be a table of longer float[] vectors.
This is the code I've just committed.
ii. (alternative) "quantize" the learned document vectors to take just
the most significant dimensions and translate these into the short[]
format. I've committed the VectorUtils necessary for this - it gave
pretty strange results so I haven't made it part of the main pipeline.
However, it may still stimulate experiments.

For the short term, if you want to try retraining, check out the main
svn code, compile, and then use the "-tc" argument if you want to use
more than one training cycle. If the code passes review, this will go
into the next release.

For review: does anyone think that these changes may be actually
harmful? This question is particularly for Trevor - are we likely to
regret using a Hashtable with short[] values instead of the explcit
short[][] representation? The new Hashtable version is more flexible,
it allows any Object to be used as a key instead of just the integer
IDs from the Lucene index, but the old version is more space efficient
(no storage for the keyspace). If the space is a problem, perhaps we
can refactor the new VectorStoreSparseRAM to only permit integer IDs?

Best wishes,
Dominic

Reply all

Reply to author

Forward

0 new messages