Zero document vectors for default corpus

32 views

Skip to first unread message

Nikola Morena

unread,

Nov 17, 2015, 11:08:12 AM11/17/15

to Semantic Vectors

Hi,

I'm following instructions from wiki/InstallationInstructions#to-build-and-search-a-model using resources/testdata/John as a test corpus. Lucene index and term vectors are created just as expected, but for all documents I'm getting zero vectors. What am I doing wrong?

Seedlength: 10, Dimension: 200, Vector type: REAL, Minimum frequency: 0, Maximum frequency: 2147483647, Number non-alphabet characters: 2147483647, Contents fields are: [contents]

Initialized LuceneUtils from Lucene index in directory: D:\devResources\LSA\semanticvectors\src\test\resources\testdata\JohnIndex

Creating term vectors as superpositions of elemental document vectors ...

Initialized LuceneUtils from Lucene index in directory: D:\devResources\LSA\semanticvectors\src\test\resources\testdata\JohnIndex

Creating semantic term vectors ...

There are 1368 terms (and 21 docs).

Training term vectors for field contents

Processed 0 terms ... Processed 1000 terms ...

Created 1368 term vectors.

Writing term vectors to termvectors

About to write 1368 vectors of dimension 200 to Lucene format file: termvectors.bin ... finished writing vectors.

Writing vectors incrementally to file docvectors.bin ... nov 17, 2015 3:25:22 PM pitt.search.semanticvectors.IncrementalDocVectors trainIncrementalDocVectors

WARNING: Outputting zero vector for document 'D:\devResources\LSA\semanticvectors\src\test\resources\testdata\John\Chapter_1'. This probably means that none of the -contentsfields were populated, or all terms failed the LuceneUtils termsfilter. You may want to investigate.

nov 17, 2015 3:25:22 PM pitt.search.semanticvectors.IncrementalDocVectors trainIncrementalDocVectors (… the same for all chapters)

Dominic Widdows

unread,

Nov 17, 2015, 6:35:24 PM11/17/15

to semanti...@googlegroups.com

Hi Nikola,

As I mentioned offline, I'd like to confirm that this is a regression (that I should have spotted earlier) in the instruction to use

java org.apache.lucene.demo.IndexFiles $PATH_TO_READ_YOUR_CORPUS

This isn't compatible with the incremental document indexing that Semantic Vectors does nowadays. Instead please use

java pitt.search.semanticvectors.IndexFilePositions $PATH_TO_READ_YOUR_CORPUS

I've updated https://github.com/semanticvectors/semanticvectors/wiki/InstallationInstructions accordingly.

Please let me know if you have any further trouble.

Best wishes,

Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages