Zero document vectors for default corpus

31 views
Skip to first unread message

Nikola Morena

unread,
Nov 17, 2015, 11:08:12 AM11/17/15
to Semantic Vectors

Hi,

I'm following instructions from wiki/InstallationInstructions#to-build-and-search-a-model using resources/testdata/John as a test corpus. Lucene index and term vectors are created just as expected, but for all documents I'm getting zero vectors. What am I doing wrong?

 

Seedlength: 10, Dimension: 200, Vector type: REAL, Minimum frequency: 0, Maximum frequency: 2147483647, Number non-alphabet characters: 2147483647, Contents fields are: [contents]

Initialized LuceneUtils from Lucene index in directory: D:\devResources\LSA\semanticvectors\src\test\resources\testdata\JohnIndex

Creating term vectors as superpositions of elemental document vectors ...

Initialized LuceneUtils from Lucene index in directory: D:\devResources\LSA\semanticvectors\src\test\resources\testdata\JohnIndex

Creating semantic term vectors ...

There are 1368 terms (and 21 docs).

Training term vectors for field contents

Processed 0 terms ... Processed 1000 terms ...

Created 1368 term vectors.

Writing term vectors to termvectors

About to write 1368 vectors of dimension 200 to Lucene format file: termvectors.bin ... finished writing vectors.

Writing vectors incrementally to file docvectors.bin ... nov 17, 2015 3:25:22 PM pitt.search.semanticvectors.IncrementalDocVectors trainIncrementalDocVectors

WARNING: Outputting zero vector for document 'D:\devResources\LSA\semanticvectors\src\test\resources\testdata\John\Chapter_1'. This probably means that none of the -contentsfields were populated, or all terms failed the LuceneUtils termsfilter. You may want to investigate.

nov 17, 2015 3:25:22 PM pitt.search.semanticvectors.IncrementalDocVectors trainIncrementalDocVectors (… the same for all chapters)

Dominic Widdows

unread,
Nov 17, 2015, 6:35:24 PM11/17/15
to semanti...@googlegroups.com
Hi Nikola,

As I mentioned offline, I'd like to confirm that this is a regression (that I should have spotted earlier) in the instruction to use 
java org.apache.lucene.demo.IndexFiles $PATH_TO_READ_YOUR_CORPUS

This isn't compatible with the incremental document indexing that Semantic Vectors does nowadays. Instead please use
java pitt.search.semanticvectors.IndexFilePositions $PATH_TO_READ_YOUR_CORPUS


Please let me know if you have any further trouble.

Best wishes,
Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages