terms weights

52 views
Skip to first unread message

Al

unread,
Jul 2, 2015, 9:45:21 AM7/2/15
to semanti...@googlegroups.com
Would you kindly advice how to get terms weight with SV, please?

Dominic Widdows

unread,
Jul 2, 2015, 6:01:40 PM7/2/15
to semanti...@googlegroups.com
Where do you want to get term weights, please?

Programmatically, you get them using the LuceneUtils class. To configure which term weights are used in document indexing, you use the -termweight flag. (see https://code.google.com/p/semanticvectors/wiki/DocumentSearch).

So it depends on where you want to get them / how you want to use them.

Best wishes,
Dominic

On Thu, Jul 2, 2015 at 6:45 AM, Al <adiz...@gmail.com> wrote:
Would you kindly advice how to get terms weight with SV, please?

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

Al

unread,
Jul 4, 2015, 12:12:48 AM7/4/15
to semanti...@googlegroups.com
Hi there

Concerning your question why I need terms weight: "In order to quantitatively judge the similarity between a pair of documents, a method is needed to determine the significance of each term in differentiating one document against other documents." Different weighting schemes have been proposed to help define the
significance of terms, the TF/IDF among other. More specifically, I'm interested to compare my results with using it at the pre-processing stage, for the K-Means clustering, with  some different tool providing the IDF based weights, for example, the SV.

Applying the sample command from the SV site: https://code.google.com/p/semanticvectors/wiki/InstallationInstructions/#To Build and Search a Model:

C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.LSA -termweight idf positional_index

Exception in thread "main" java.lang.IllegalArgumentException: -luceneindexpath must be set.
        at pitt.search.semanticvectors.LSA.main(LSA.java:240)

So, I added this flag: -luceneindexpath C:\TOOLS\semantics\SV\indexDir

C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.LSA -termweight idf positional_index -luceneindexpath C:\TOOLS\semantics\SV\indexDir

but getting same error again:

Exception in thread "main" java.lang.IllegalArgumentException: -luceneindexpath must be set.
        at pitt.search.semanticvectors.LSA.main(LSA.java:240)

It's somewhat confusing, particularly in the context of the following statement, at the same page: "Any term weighting for documents is computed when the document vectors are created , as part of the index building process (which was done already,- Al). So giving a -luceneindexpath argument when using documents as queries will not help you at all, and can cause SemanticVectors to discard your query terms (since, for example, /files/file1.txt isn't a term that the Lucene index recognizes). "

So, how to run "Using Documents as Queries" With term weighting switched on using -termweight idf described at https://code.google.com/p/semanticvectors/wiki/DocumentSearch ?

Thank you.

Al

Dominic Widdows

unread,
Jul 5, 2015, 9:19:28 PM7/5/15
to semanti...@googlegroups.com
Sorry for the delayed response - holiday, family birthday, and I will be travelling for the next couple of days.

For your immediate error message, try removing positional_index. I suspect that the problem arises from the program trying to parse this argument before even reaching the -luceneindexpath. 


C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.LSA -termweight idf positional_index -luceneindexpath C:\TOOLS\semantics\SV\indexDir

For your other question - once you have document search working you should see difference between vectors built using -termweight idf and -termweight logentropy. But you will need to use these arguments in different BuildIndex runs and distinguish between which termvectors and docvectors were built with which options.

Best wishes,
Dominic

Al

unread,
Jul 5, 2015, 11:16:02 PM7/5/15
to semanti...@googlegroups.com
It's not a delayed reply  but rather a prompt one  --Thank you! 
And Happy Birthday!
Will try this search again as advised.
Have a safe trip.

Al

Al

unread,
Jul 8, 2015, 12:18:16 AM7/8/15
to semanti...@googlegroups.com
Hi Dominic

My yesterday excitement about the fix found is now curbed down with new errors:
While I'm trying to use PATH_TO_LUCENE_INDEX =
-luceneindex path C:\TOOLS\semantics\SV\indexDir , I'm getting:

C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.LSA -termweight idf -luceneindex path C:\TOOLS\semantics\SV\indexDir
Initialized LuceneUtils from Lucene index in directory: C:\TOOLS\semantics\SV\indexDir
Jul 06, 2015 5:32:00 PM pitt.search.semanticvectors.LSA <init>
WARNING: Dimension for SVD cannot be greater than number of documents ... Setting dimension to 1
Set up LSA indexer.
Dimension: 1 Lucene index contents field: 'contents' Minimum frequency = 0 Maximum frequency = 2147483647 Number non-alphabet characters = 2147483647
There are 34283 terms (and 1 docs).
Starting SVD using algorithm LAS2 ...
Wrote 34283 term vectors incrementally to file termvectors.
Wrote 1 document vectors incrementally to file docvectors. Done.

BUT the BuildIndex returns:
C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.BuildIndex -luceneindex path C:\TOOLS\semantics\SV\indexDir -indexfileformat text

BuildIndex class in package pitt.search.semanticvectors
Usage: java pitt.search.semanticvectors.BuildIndex -luceneindexpath PATH_TO_LUCENE_INDEX
BuildIndex creates termvectors and docvectors files in local directory.
Other parameters that can be changed include number of dimensions, vector type (real, binary or complex), seed length (number of non-zero entries in basic vectors), minimum term frequency, max. number of non-alphabetical characters per term, filtering of numeric terms (i.e. numbers), and number of iterative training cycles.
To change these use the command line arguments
  -vectortype [real, complex or binary]
  -dimension [number of dimension]
  -seedlength [seed length]
  -minfrequency [minimum term frequency]
  -maxnonalphabetchars [number non-alphabet characters (-1 for any number)]
  -filternumbers [true or false]
  -trainingcycles [training cycles]
  -docindexing [incremental|inmemory|none] Switch between building doc vectors incrementally
        (requires positional index), all in memory (default case), or not at all
Exception in thread "main" java.lang.IllegalArgumentException: Command line flag not defined: luceneindex
        at pitt.search.semanticvectors.FlagConfig.getFlagConfig(FlagConfig.java:430)
        at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:76)

I tried searching SV wiki about luceneindex command line flags but didn't find some documentation how to use it, though might be just missed it.
Would you kindly point out where to get these instructions? While now under some deadlines, I need to complete this part of my project. I'd greatly appreciate your advice.
Thank you,

Al

On Sunday, July 5, 2015 at 8:19:28 PM UTC-5, Dominic wrote:

Dominic Widdows

unread,
Jul 8, 2015, 12:34:59 AM7/8/15
to semanti...@googlegroups.com
Hi Al,

The immediate fix is that -luceneindex path C:\TOOLS\semantics\SV\indexDir
should be -luceneindexpath C:\TOOLS\semantics\SV\indexDir
(-luceneindexpath is all one word).

I don't know why the message from the LSA command is different from the message form the BuildIndex command - I see something different:

$ java pitt.search.semanticvectors.LSA -termweight idf -luceneindex path

LSA class in package pitt.search.semanticvectors
Usage: java pitt.search.semanticvectors.LSA [other flags] -luceneindexpath PATH_TO_LUCENE_INDEXUse flags to configure dimension, min term frequency, etc. See online documentation for other available flags
Exception in thread "main" java.lang.IllegalArgumentException: Command line flag not defined: luceneindex
        at pitt.search.semanticvectors.FlagConfig.getFlagConfig(FlagConfig.java:430)
        at pitt.search.semanticvectors.LSA.main(LSA.java:229)

In general it would be good to clean up the console output and make sure that errors rise to the top / are more prominent. But for now the argument fix should help you.

Best wishes,
Dominic

Al

unread,
Jul 8, 2015, 1:15:23 AM7/8/15
to semanti...@googlegroups.com
Hi, Dominic

Many thanks for instructions -- i'll apply them tonight.
Thank you.
Best,

Al

Al

unread,
Jul 9, 2015, 12:01:06 AM7/9/15
to semanti...@googlegroups.com
Good evening, Dominic

Exactly as you instructed, the correct spelling of the luceneindex path => luceneindexpath did fix the search  -- THANK YOU!
Though, I'm still unable to get output for both, CompareTerms and Search with SPARSESUM, MAXSIM, or just SUM options.

C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.CompareTerms -luceneindexpath C:\TOOLS\semantics\SV\indexDir mahamudra bliss
Outputting similarity of 'mahamudra' with 'bliss':
Setting dimension of target config to: 1
Opened query vector store from file: termvectors

Initialized LuceneUtils from Lucene index in directory: C:\TOOLS\semantics\SV\indexDir
Found vector for 'mahamudra'
Found vector for 'bliss'
NaN

C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.Search -luceneindexpath C:\TOOLS\semantics\SV\indexDir -searchtype sparsesum mahamudra bliss
Search class in package pitt.search.semanticvectors
Usage: java pitt.search.semanticvectors.Search [-queryvectorfile query_vector_file]
 ...
Exception in thread "main" java.lang.IllegalArgumentException: No enum constant pitt.search.semanticvectors.Search.SearchType.SPARSESUM
Accepted values for '-searchtype' are:
[SUM, SUBSPACE, MAXSIM, MINSIM, PERMUTATION, BALANCEDPERMUTATION, BOUNDPRODUCT, LUCENE, BOUNDMINIMUM, BOUNDPRODUCTSUBSPACE, INTERSECTION, ANALOGY, PRINTQUERY, PRINTPSIQUERY, PROXIMITY]
        at pitt.search.semanticvectors.FlagConfig.getFlagConfig(FlagConfig.java:412)
        at pitt.search.semanticvectors.Search.main(Search.java:460)

C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.Search -luceneindexpath C:\TOOLS\semantics\SV\indexDir -searchtype maxsim  mahamudra bliss
Opening query vector store from file: termvectors
Setting dimension of target config to: 1

Initialized LuceneUtils from Lucene index in directory: C:\TOOLS\semantics\SV\indexDir
Searching term vectors, searchtype MAXSIM
Found vector for 'mahamudra'
Found vector for 'bliss'
No search output.

C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.Search -searchtype sum -luceneindexpath C:\TOOLS\semantics\SV\indexDir -searchtype sum mahamudra bliss
Opening query vector store from file: termvectors
Setting dimension of target config to: 1

Initialized LuceneUtils from Lucene index in directory: C:\TOOLS\semantics\SV\indexDir
Searching term vectors, searchtype SUM
Found vector for 'mahamudra'
Found vector for 'bliss'
No search output.

Many thanks by advance for your time and help with this issue!
Regards,

Al


On Tuesday, July 7, 2015 at 11:34:59 PM UTC-5, Dominic wrote:

Al

unread,
Jul 13, 2015, 12:46:54 AM7/13/15
to semanti...@googlegroups.com
Hi, there

Sorry to re-iterate my question: I'm still unable to get output for both, CompareTerms and Search with SPARSESUM, MAXSIM, or just SUM options.
I'm getting NaN: Initialized LuceneUtils from Lucene index in directory: C:\TOOLS\semantics\SV\indexDir

Found vector for 'mahamudra'
Found vector for 'bliss'
NaN

C:\TOOLS\semantics\SV\trunk\target>java pitt.search.semanticvectors.Search -luceneindexpath C:\TOOLS\semantics\SV\indexDir -searchtype maxsim  mahamudra bliss
Opening query vector store from file: termvectors
Setting dimension of target config to: 1

Initialized LuceneUtils from Lucene index in directory: C:\TOOLS\semantics\SV\indexDir
Searching term vectors, searchtype MAXSIM
Found vector for 'mahamudra'
Found vector for 'bliss'
No search output.

Please, help me to fix this.
Regards,

Al


On Tuesday, July 7, 2015 at 11:34:59 PM UTC-5, Dominic wrote:

Dominic Widdows

unread,
Jul 13, 2015, 2:16:20 AM7/13/15
to semanti...@googlegroups.com
Hi Al,

I may have to investigate further, but quickly to unblock you:

SPARSESUM never worked very well and isn't currently supported. But something more in the sigmoid area might work better, I'm planning to talk to Trevor about this later today as it happens. I'll keep you in the loop but it may be a slow loop!

MAXSIM, SUBSPACE, and MINSIM all affect the way multiple words are combined into a single vector, and when you're just comparing two terms they shouldn't change anything. (To be clearer with an example, they only matter if you're comparing "red yellow" with "orange", not wen comparing just "red" with "orange".) So there may be something messed up in the implementation when there is no term combination going on. because I will not have tested it properly. So I should investigate and fix this, but you shouldn't wait on this because it won't help you.

Best wishes,
Dominic

Al

unread,
Jul 19, 2015, 5:03:27 PM7/19/15
to semanti...@googlegroups.com
Hi, Dominic

Thanks to "unblock" me, and well, I understand that my options are not yet supported, as for now. If meanwhile you'd have an idea how I may get the weight for the top 20 terms from my text set, I'd be grateful.
Best,

Al
Reply all
Reply to author
Forward
0 new messages