Second Termvectorsfile when using buildIndex and -trainingcycles <NUMBER> -docindexing incremental

18 views
Skip to first unread message

Michael Ruepp

unread,
Jul 18, 2015, 8:09:52 AM7/18/15
to semanti...@googlegroups.com
When using this cli (or the same programatically):

java -cp ~/Downloads/semanticvectors-5.8.jar pitt.search.semanticvectors.BuildIndex -trainingcycles <NUMBER> -luceneindexpath ../luceneIndex/Text -termtermvectorsfile ./termsvectorsfile.bin -docvectorsfile ./docvectorsfile.bin -docindexing incremental

I get two termvectors outputfiles:

The first is "TERMVECTORSFILENAME.BIN", the Second is "TERMVECTORSFILENAME<NUMBER>.BIN"

So if i use trainingcycles 2 i get termsvectorsfile2.bin, if i use 3, its termsvectorsfile3.bin additionally to the default termsvectorsfile.bin

This behaviour is not observed when using LSA or Positional Indexing with the same parameters.

Is this a bug or intended? Which termsvectorsfile should I use for search?


Thanks,

Michael



Dominic Widdows

unread,
Jul 18, 2015, 8:59:25 AM7/18/15
to semanti...@googlegroups.com
You should use the termvectors2.bin file and similar, the ones with the number in them.

You're right that BuildPositionalIndex doesn't append this, and I just noticed that the vectors aren't being written appropriately in BuildPositionalIndex. LSA doesn't support repeated training cycles at all.

In general the use of training cycles has been pretty slight, so it's research code and sometimes built for specific experiments - when it comes to making sure these options work consistently throughout the codebase, that's not been a focus for training cycles. For example, it's very easy to set and use new command-line flags, much harder to make sure that each entry-point knows exactly which flag parameters are relevant and irrelevant in this particular context.

Best wishes,
Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

Michael Ruepp

unread,
Jul 18, 2015, 9:07:35 AM7/18/15
to semanti...@googlegroups.com
So trainingcycles not mandatorily improves search quality? Could I live without?

Also, is there any recommendation regarding Garbage Collection? LSA is a heavy operation especially with lots of documents/terms. Standard Index with 4Mio Docs/Terms was running through with 32G Ram, but LSA I am not sure. 



Thanks,

Michael




signature.asc
Reply all
Reply to author
Forward
0 new messages