Question regarding LSA and the "-1 terms found" error.

Brian Ballsun-Stanton

unread,

Oct 18, 2013, 11:38:48 PM10/18/13

to semanti...@googlegroups.com

While setting up an LSA index for one of my students, I ran into the following error:

"There are -1 terms (and 480 docs)."

java pitt.search.semanticvectors.LSA -termweight idf -luceneindexpath positional_index/
Set up LSA indexer.
Dimension: 200 Minimum frequency = 0 Maximum frequency = 2147483647 Number non-alphabet characters = 2147483647
There are -1 terms (and 480 docs).
Exception in thread "main" java.lang.NegativeArraySizeException
at pitt.search.semanticvectors.LSA.smatFromIndex(LSA.java:96)
at pitt.search.semanticvectors.LSA.main(LSA.java:240)

The corpus is smallish with 480 docs (as shown above) and buildPositionalIndex ran (mostly) fine:

java pitt.search.semanticvectors.BuildPositionalIndex -windowradius 2 -luceneindexpath positional_index/ Building positional index, Lucene index: positional_index/, Seedlength: 10, Vector length: 200, Vector type: REAL, Minimum term frequency: 0, Maximum term frequency: 2147483647, Number non-alphabet characters: 2147483647, Window radius: 2, Fields to index: [contents]
Created basic term vectors for 128740 terms (and 480 docs).
Processed 0 documents ... Created 128740 term vectors ...
Normalizing term vectors.
About to write 128740 vectors of dimension 200 to Lucene format file: termtermvectors.bin ... finished writing vectors.
Writing vectors incrementally to file docvectors.bin ... Oct 19, 2013 3:35:12 AM pitt.search.semanticvectors.IncrementalDocVectors trainIncrementalDocVectors
SEVERE: No term vector for document 246
Finished writing vectors.

So the corpus is extant and readable (in a general sense such that queries with the positional index are possible, though I'm not sure of the quality of their results)

I'm using 4.0 from source and lucene 4.5. (I tried to check out the latest and I ran into some library problems.) Is there a more correct invocation to build an LSA index? More broadly, is this LSA index the best basis for synonym generation?

Dominic

unread,

Oct 19, 2013, 11:57:28 AM10/19/13

to semanti...@googlegroups.com

This is very confusing. I admit I do not know how this would happen. Can you try setting the -contentsfield explicitly? This is one way that LSA and BuildPositionalIndex differ (though shouldn't cause this behavior).

What library troubles did you have checking out from source? It might be that you need to put a parallelcolt jar in your classpath (http://sourceforge.net/projects/parallelcolt/) - this may become a requirement with future versions for real vector binding.

Best wishes,

Dominic

Pierluca Sangiorgi

unread,

Oct 21, 2013, 10:09:00 AM10/21/13

to semanti...@googlegroups.com

Same problem using solr 4.5 indexes and sv 4:

There are -1 terms (and 56 docs).

java.lang.NegativeArraySizeException

at pitt.search.semanticvectors.LSA.smatFromIndex(LSA.java:96)

at pitt.search.semanticvectors.LSA.main(LSA.java:240)

contentsfield is defined explicitly (needed for solr).

adding parallelcot don't solve for me

Pierluca Sangiorgi

unread,

Oct 21, 2013, 10:39:25 AM10/21/13

to semanti...@googlegroups.com

Since i haven't this problem with 3.8 or using BuildIndex, i took a look in the source...

in sv 3.8, terms are read from indexes using IndexReader class from Lucene, while in 4.0 by LuceneUtils.getTermsForField that is not used by BuildIndex.

I think the problem is in that method, is it possible?

Dominic Widdows

unread,

Oct 21, 2013, 1:12:51 PM10/21/13

to semanti...@googlegroups.com

There's certainly something happening to make LuceneUtils.getTermsForField return something with a negative size (with SV 4.0 and the current svn source). This method simply returns atomicReader.terms(field), which is instantiated using:

this.compositeReader = DirectoryReader.open(FSDirectory.open(new File(flagConfig.luceneindexpath())));

this.atomicReader = SlowCompositeReaderWrapper.wrap(compositeReader);

Essentially, as I understand it, Lucene 4.0 did a lot to enhance distributed and parallel features, and this means that for standalone "atomic" index readers, you have to go to greater lengths to say explicitly that you want to iterate over the terms.

But I still don't know why this method would return something with size -1, its public documentation says that it would at worst return null:

http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/AtomicReader.html#terms(java.lang.String)

I can cause such a null return value by setting (say) "-contentsfields foo" in an index that doesn't have a field "foo", but I can't create a "-1" size yet. I should have a bit more time to investigate later.

Best wishes,

Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/groups/opt_out.

Pierluca Sangiorgi

unread,

Oct 21, 2013, 6:11:02 PM10/21/13

to semanti...@googlegroups.com

I'll probably do a stupid question...

why use a different method in lsa to get terms from lucene index?
getTermsForField method is called before SVD process and i suppose that it "simply" gets terms from the index and put them in array. Is not possible to get them exactly as random indexing or something like this? In the other indexing type they are surely managed differently, but certanly they are retrieved from the same source.

Dominic Widdows

unread,

Oct 21, 2013, 6:28:22 PM10/21/13

to semanti...@googlegroups.com

Perfectly reasonable question.

LuceneUtils.getTermsForField is used in several classes - LSA, PSI, TermVectorsFromLucene, etc. It's meant to be pretty consistent. Why have a distinct method for just a single line of code? Well, updating to Lucene 4.x was pretty in-depth, and I'm still not sure if our combination of an atomic reader and a composite reader for traversing an index is workable in all cases. So we decided to put as much of this functionality as possible in one place, LuceneUtils, to make it easier to maintain if we have to change.

Best wishes,

Dominic

Trevor Cohen

unread,

Oct 21, 2013, 6:50:20 PM10/21/13

to semanti...@googlegroups.com

I've experienced "-1" errors with Luke (https://code.google.com/p/luke/) when trying to open a newer Lucene index with an older version of Luke. The problem doesn't occur with SV 3.8, which uses Lucene 3, which raises the possibility that this index may not be accessible with Lucene 4, at least not in the way we're using it.

- Trevor

Brian Ballsun-Stanton

unread,

Oct 22, 2013, 3:19:11 AM10/22/13

to semanti...@googlegroups.com

Well, 4.1 didn't build without parallelcolt. Unfortunately, 4.1 still has the -1 error. Oh well. Trying 3.8 now.

On Tue, Oct 22, 2013 at 6:09 PM, Brian Ballsun-Stanton <br...@fedarch.org> wrote:

I can attest that it was indeed parallelcolt that was missing. Thanks!

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/groups/opt_out.

--

Brian Ballsun-Stanton Ph.D.
FAIMS Project - Data Architect
br...@fedarch.org 0479 179 749

--

Brian Ballsun-Stanton Ph.D.
FAIMS Project - Data Architect
br...@fedarch.org 0479 179 749

Pierluca Sangiorgi

unread,

Oct 22, 2013, 10:39:29 AM10/22/13

to semanti...@googlegroups.com

i used 3.8 in the past and works, but with solr/lucene 3.6

2013/10/22 Brian Ballsun-Stanton <br...@fedarch.org>

--
You received this message because you are subscribed to a topic in the Google Groups "Semantic Vectors" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/semanticvectors/LOO1hPro5lI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to semanticvecto...@googlegroups.com.

Dominic

unread,

Oct 22, 2013, 10:42:12 AM10/22/13

to semanti...@googlegroups.com

Hi Brian,

From Trevor's suggestion, it sounds like your problem might be to do with incompatible Lucene versions, rather than SV compatibility, and the problem has been seen with SV 3.8 as well as 4. So if you still have trouble, it might be necessary to rebuild your Lucene indexes.

Best wishes,

Dominic

Pierluca Sangiorgi

unread,

Oct 27, 2013, 3:28:03 PM10/27/13

to semanti...@googlegroups.com

Hi Dominic, do you have investigated about this problem? any news?

thanks and bye

Pierluca

Dominic

unread,

Oct 28, 2013, 1:17:29 PM10/28/13

to semanti...@googlegroups.com

The problem comes from incompatible Lucene versions. I've reproduced the problem as follows:

- Set classpaths to Lucene 3.6.2.

- Build a Lucene index using Lucene 3.6.2. (java org.apache.lucene.demo.IndexFiles -index lucene_3_6_2_index -docs bible_chapters/)

- Set classpaths to Lucene 4.3.1.

- Try to run LSA with SemanticVectors 4.2.

- The problem occurs.

So, the initial suggestion is to rebuild any Lucene 3.x indexes with an appropriate Lucene 4.x index.

It's possible that we could investigate further into *why* this happens with LSA and not with the other BuildIndex methods, but given that Lucene is reasonably firm in what it does and doesn't support in terms of backward compatibility, I doubt this would be a good use of time.

Best wishes,

Dominic

Pierluca Sangiorgi

unread,

Oct 28, 2013, 2:36:49 PM10/28/13

to semanti...@googlegroups.com

I'm using solr 4.5 (that i suppose use lucene 4.5 index) and i have the same problem with sv 4.0.
Any suggestions or there's no solution to use lsa with solr 4.5?

--
You received this message because you are subscribed to a topic in the Google Groups "Semantic Vectors" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/semanticvectors/LOO1hPro5lI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to semanticvecto...@googlegroups.com.

Dominic Widdows

unread,

Oct 28, 2013, 3:02:11 PM10/28/13

to semanti...@googlegroups.com

Using Lucene 4.5.0 and SV 4.0, I was just able to build a Lucene index and run LSA against it.

Can you try rebuilding your Lucene indexes and make sure they're build with Lucene / Solr 4.5.1? (I'm presuming that Lucene and Solr version numbers are in lock-step with each other.)

Best wishes,

Dominic

--

You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.

To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.

Pierluca Sangiorgi

unread,

Oct 30, 2013, 7:53:52 AM10/30/13

to semanti...@googlegroups.com

Hi Dominic,

I rebuilt my index with lucene 4.5 using IndexFiles of sv4.5 and lsa works without -1 term error.

The problem is when i try to create lsa space over my solr 4.5 index.

I don't know if it's important, but I've noticed that my solr 4.5 index files are differents from my old solr 3.6 index files or lucene 4.5 index files with the same documents.

In solr 4.5 i've a lot of more file and different extension (.fdt, .tvd, .pos, .tim, .doc, etc) instead of (.fdt, .tvf, .prx, .frq, .tis, etc).

Could be this the problem?

thanks

Martin Voigt

unread,

Jun 30, 2014, 5:48:19 AM6/30/14

to semanti...@googlegroups.com

Hi all,

I just tried to get LSA running with our current Solr installation (4.8). Therefore, I build SV from source with the current lucene version and I'm able to run the usual "BuildIndex" without problems. But with LSA, I still get the "There are -1 terms" error leading to an exception. Having a look in the code, I couldn't find a difference to BuildIndex. There are any ideas?

Thanks,

Martin

Dominic

unread,

Jul 4, 2014, 3:33:20 PM7/4/14

to semanti...@googlegroups.com

Hi Martin,

I'm sorry to be so slow to respond on this, I'm travelling at the moment.

I suspect it may be something to do with the contents field that the LSA indexer is expecting. Unlike BuildIndex, there can only be one. But ... if you're using the same command and not setting this, I don't see why the default would not work. I've just committed a change to log the contentsField in the output, if you update your svn enlistment you should get this.

Please check that you're using the -contentsfield flag in the same way with both commands, and feel free to write back to me with a copy of the full console output from the failed indexing command if you have trouble.

Best wishes,

Dominic

irish...@gmail.com

unread,

Oct 16, 2014, 7:43:03 AM10/16/14

to semanti...@googlegroups.com

Dear Dominic,

I have exactly the same problem with the SV 5.4. I am doing the following:
(By only using the SV 5.4.jar)
- java -Xms1G -Xmx1G -cp /semantik/semanticvectors-5.4.jar org.apache.lucene.demo.IndexFiles -docs docpath -index luceneIndex
- java -Xms1G -Xmx1G -cp /semantik/semanticvectors-5.4.jar pitt.search.semanticvectors.LSA -luceneindexpath luceneIndex

I tried it under Windows and Linux, without success.

Could you please let me know under which conditions (version of lucene and SV and also operating system) you were able to make it work?

Thanks a lot,
Leo

Trevor Cohen

unread,

Oct 16, 2014, 12:00:31 PM10/16/14

to semanti...@googlegroups.com

Hi Leo,

This sequence works for me with semanticvectors-5.4.jar on the test corpus that is distributed with the package (on a Mac, as it happens, but I wouldn't expect different results on Linux). So perhaps it has something to do with the nature of the corpus, or the way this is divided.

Regards,

Trevor

java -cp ../semanticvectors-5.4.jar org.apache.lucene.demo.IndexFiles -docs bible_chapters -index luceneindex

java -Xmx1G -cp ../semanticvectors-5.4.jar pitt.search.semanticvectors.LSA -luceneindexpath luceneIndex

Set up LSA indexer.

Dimension: 200 Minimum frequency = 0 Maximum frequency = 2147483647 Number non-alphabet characters = 2147483647

There are 12785 terms (and 1190 docs).

Starting SVD using algorithm LAS2 ...

Wrote 12785 term vectors incrementally to file termvectors.

Wrote 1190 document vectors incrementally to file docvectors. Done.

For more options, visit https://groups.google.com/d/optout.

Michael Sperling

unread,

Dec 9, 2014, 5:52:22 PM12/9/14

to semanti...@googlegroups.com

Was this problem ever resolved? I can successfully build the demo bible document vectors using LSA. However when I try on a larger corpus (319,000 Enron emails), I get the same error:

Set up LSA indexer.

Dimension: 200 Minimum frequency = 0 Maximum frequency = 2147483647 Number non-a

lphabet characters = 2147483647

There are -1 terms (and 319325 docs).

Exception in thread "main" java.lang.NegativeArraySizeException

at pitt.search.semanticvectors.LSA.smatFromIndex(LSA.java:97)

at pitt.search.semanticvectors.LSA.main(LSA.java:241)

I am using Semantic Vectors 5.4 and I've tried building the Lucene index with versions 4.5 and 4.6 with identical results. If I use the default random projection algorithm, the vectors build successfully.

Michael Sperling

Dominic Widdows

unread,

Dec 10, 2014, 7:59:54 PM12/10/14

to semanti...@googlegroups.com

Hi Michael,

The problem was never confirmed resolved, and I don't have a repro at the moment. However, I did take another look at the code throwing the exception, and it relied on Lucene TermsEnum.size() calls in ways that TermVectorsFromLucene (used bu BuildIndex) does not. So I have tried replacing the LSA implementation with the same pattern as TermVectorsFromLucene.

The attempted fix is checked in (https://code.google.com/p/semanticvectors/source/detail?r=1177), so please let me know if you can build from source and if so, if this fixes the problem.

On the other hand, if the problem remains, please could you send a list of the commands you used to download and index the enron corpus? I'll try to repro the problem if you do.

Best wishes,

Dominic

Michael Sperling

unread,

Dec 12, 2014, 9:54:57 AM12/12/14

to semanti...@googlegroups.com

I will let you know.

Thanks,

Mike

Michael Sperling

unread,

Dec 15, 2014, 1:15:48 PM12/15/14

to semanti...@googlegroups.com

Dominic,

It worked! I was able to build using the source and run my Enron corpus to completion using LSA. I've been comparing results against random projection - random projection actually seems to work better. Also, just to let you know, there is a difference in specifying the output termvectors and docvectors files between RP and LSA. In RP, you can specify an absolute path for the output files. In LSA, you need to specify a path that appears to be relative to the parent of the Lucene index.

Mike

On Wednesday, December 10, 2014 7:59:54 PM UTC-5, Dominic wrote:

Emre

unread,

Jan 4, 2015, 11:49:35 AM1/4/15

to semanti...@googlegroups.com

Hi all,

I have the same problem. I can successfully build the demo bible document vectors using LSA. But with a large corpus i get the "There are -1 terms" error. I'm using semantic vectors package from command prompt(i.e I'im not compiling the source code). So what can i do?

15 Aralık 2014 Pazartesi 20:15:48 UTC+2 tarihinde Michael Sperling yazdı:

Dominic Widdows

unread,

Jan 5, 2015, 9:43:06 PM1/5/15

to semanti...@googlegroups.com

Hello there,

I've just started a new release (5.6) to the maven central repository. It takes a while to go through staging, I'll update the main list when it's available.