SEVERE: No term vector for document (for almost all documents) while building model

49 vues
Accéder directement au premier message non lu

Kshitij Dholakia

non lue,
26 nov. 2014, 01:19:3226/11/2014
à semanti...@googlegroups.com
Hello,

I'm attempting to Build a model using a Lucene index (4.10.2) using the sematicvectors-5.5.jar (that I rebuilt using the diff.patch Dominic provided in an older post). The Lucene index consists of around 4.5 million documents (which are StackOverflow posts). I've indexed Body and Title fields and also stored the term vectors (TermVectors = "true") in the schema.xml file. 

The program processes all the terms (close to 9 million, my minfrequency param = 0), but when it comes to the training stage (I guess), it is showing the following message "SEVERE: No term vector for document <document number>" for almost all the documents. This is confusing because I've included all terms irrespective of frequency. 

I verified that I've stored the termVectors from the Solr console (I'm not sure if that has got to do with anything though).

Thanks a lot!  

Dominic Widdows

non lue,
26 nov. 2014, 12:53:4026/11/2014
à semanti...@googlegroups.com
Hi there,

It looks like you're at line 250 of TermTermVectorsFromLucene - are you using BuildPositionalIndex?

If so, have you set the -contentsfields flag, e.g., -contentsfields title,contents,body,whatever ... ? If this is not set in step with your lucene index, that could lead to the sparse behavior you're seeing.

If this doesn't fix the issue, please when you write back include a copy and paste of the whole console output, from when you issue the "java pitt.search.semanticvectors.... " command and onwards from there for the whole process. (Hopefully we can solve this without going back through your whole Solr console process, which I'm not familiar with.)

Best wishes,
Dominic



--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

Le message a été supprimé
Le message a été supprimé

Kshitij Dholakia

non lue,
26 nov. 2014, 23:00:0226/11/2014
à semanti...@googlegroups.com
 I used the following command: "java -Xmx15g pitt.search.semanticvectors.BuildIndex -docindexing incremental -luceneindexpath /home/ubuntu/data/index/ "

 Here is the entire sample output: https://s3-us-west-2.amazonaws.com/data-sci-stack-overflow/output.txt

 (I think the SEVERE message is being printed by IncrementalDocVectors.java)

 I did not use the contentsfield argument, but did something silly, I modified FlagConfig.java and added "Body" and "Title" to private String[] contentsfields.  

 Also, I'm using Solr's Filters while indexing to tokenize, stem and remove stop words from Body and Title. I think BuildIndex attempts to do something similar? Do you  think that has to do with anything?

Thanks a lot!

Dominic Widdows

non lue,
28 nov. 2014, 16:55:3828/11/2014
à semanti...@googlegroups.com
Hi there,

Thanks for the transcript. It hadn't occurred to me that the console output would get so big, I can see why you didn't just cut and paste it in!

The immediate thing I've done is checked in a change that logs a lower-level message when a field is empty, and issues a warning message only when every contentsfield in a document is empty. See https://code.google.com/p/semanticvectors/source/detail?r=1166

So here's what I suspect is happening:
i. Term vector indexing / learning has already completed, so you should have a termvectors.bin file that you can use for term-term analyses.
ii. The document vector warning messages are only happening for some documents, not all.
iii. The code was giving the severe messages every time any contentsfield is empty. This I think was a mistake in the code. Now if you recompile, you should just get a fine message when a single field is empty, and only get a severe message when the whole document is empty.
iv. In addition, these messages should now have the user-provided docid (typically the path / filename), rather than Lucene's internal doc integer ID.

If I'm correct / lucky, you should be able to check out the latest code, recompile, and rerun without seeing these problems. If you want to confirm the hypothesis of what's happening, you should be able to check the finer error messages and see if they correspond with documents where you have (say) a title and no body, or something like that.

I have not written new test cases for this, that would take a bit longer because I'd need to hack up an example Lucene index. I would like to create some tests that do this in due course, but in the interests of time I wanted to share what I think is a reasonable diagnosis and fix so that you can see if works for you.

Best wishes,
Dominic

Kshitij Dholakia

non lue,
29 nov. 2014, 15:46:5629/11/2014
à semanti...@googlegroups.com
Hey, I got a 'MissingFormatArgumentException: Format specifier 's' (Line 171 in IncrementalDocVectors.java). 

Here's the entire message: 

Created 9807314 term vectors.
About to write 9807314 vectors of dimension 200 to Lucene format file: termvectors.bin ... finished writing vectors.
Writing vectors incrementally to file docvectors.bin ... Exception in thread "main" java.util.MissingFormatArgumentException: Format specifier 's'
        at java.util.Formatter.format(Formatter.java:2487)
        at java.util.Formatter.format(Formatter.java:2423)
        at java.lang.String.format(String.java:2790)
        at pitt.search.semanticvectors.IncrementalDocVectors.trainIncrementalDocVectors(IncrementalDocVectors.java:171)
        at pitt.search.semanticvectors.IncrementalDocVectors.createIncrementalDocVectors(IncrementalDocVectors.java:88)
        at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:116)
ubuntu@ip-172-31-1-39:~/data$ 


It corresponds to this line: 

"Document vector is zero for document '%s'. This probably means that none of " +
                "the -contentsfields were populated. this is a bad sign and should be investigated."));


 Is there a ", dc" argument needed over there?

Thanks!

Dominic Widdows

non lue,
30 nov. 2014, 10:59:5230/11/2014
à semanti...@googlegroups.com
Ouch, you're right, thanks for the quick feedback. I've just checked in a fix (https://code.google.com/p/semanticvectors/source/detail?r=1170).

Note to self, must test this with small artificial lucene index.

Hope that helps. Though the fact that you're reaching this line at all indicates that we are seeing some of the problem cases where all the contentsfields in a particular document are absent.

Best wishes,
Dominic

Kshitij Dholakia

non lue,
1 déc. 2014, 15:09:1001/12/2014
à semanti...@googlegroups.com
It turned out that the script that was indexing the documents was eating away the fields somehow leading to the SEVERE messages. After re-indexing, I got the new debug message (from the fix) for just 1 document. I'll get onto the analysis/document search part for now. 

I've used the model using a pretty huge corpus (~4.5 million documents), so let me know if you want info on any specific results/tests.

Thanks a lot!

Dominic Widdows

non lue,
1 déc. 2014, 19:28:2801/12/2014
à semanti...@googlegroups.com
Good news, then - I'm glad this fix worked for your needs so far. Hope you start to get interesting results soon, and yes please, feel free to share anything that you find interesting or confusing.

Best wishes,
Dominic
Répondre à tous
Répondre à l'auteur
Transférer
0 nouveau message