Using SV with Solr

60 views
Skip to first unread message

David

unread,
Dec 2, 2008, 3:19:11 PM12/2/08
to Semantic Vectors
I am attempting to use SV with Solr's index. I am very new to both
programs. When I attempt

java pitt.search.sematnticvectors.BuildIndex solr/data/index

on the demo solr index, I get the following error:

seedLength = 20
Vector length = 200
Minimum frequency = 10
Populating basic sparse doc vector store, number of vectors: 26
Creating store of sparse vectors ...
Created 26 sparse random vectors.
Creating term vectors ...
There are 1137 terms (and 26 docs)
0 ... 1000 ...
Created 0 term vectors ...
Initializing document vector store ...
Building document vectors ...

Normalizing doc vectors ...
Exception in thread "main" java.lang.NullPointerException
at pitt.search.semanticvectors.DocVectors.makeWriteableVectorStore
(DocVectors.java:143)
at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:231)


I would appreciate any help or insights you might be able to offer

Dominic

unread,
Dec 2, 2008, 4:35:48 PM12/2/08
to Semantic Vectors
It looks like you're not managing to create any term vectors, and then
when you try and create and document vectors from these you're running
into null pointer problems.

My guess (knowing nothing about solr) is that this is a Lucene field
problem. SemanticVectors expects to read from the "contents" field of
the Lucene index, and we've never made this properly configurable. You
can easily change this by setting "fieldsToIndex" to something
different in BuildIndex.java, line 198 - this is grubby but should fix
the problem, if this is it.

So, basic question for now is "what field does solr use for its main
content vectors?"

Best wishes,
Dominic

David

unread,
Dec 3, 2008, 1:09:48 PM12/3/08
to Semantic Vectors
Thank you for your swift response.

Forgive me for not completely understanding what main content vectors
refers to. I have several documents added in solr's index. Every
document has a field called ID. If I attempt to index changing line
198 to "id" i get this error as before:

seedLength = 20
Vector length = 200
Minimum frequency = 10
Populating basic sparse doc vector store, number of vectors: 26
Creating store of sparse vectors ...
Created 26 sparse random vectors.
Creating term vectors ...
There are 1137 terms (and 26 docs)
0 ... 1000 ...
Created 0 term vectors ...
Initializing document vector store ...
Building document vectors ...

Normalizing doc vectors ...
Exception in thread "main" java.lang.NullPointerException
at pitt.search.semanticvectors.DocVectors.makeWriteableVectorStore
(DocVectors.java:143)
at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:231)



if I instead change line 198 to read "text", a field that NOT all
documents in my solr index contain i get the following error:

seedLength = 20
Vector length = 200
Minimum frequency = 10
Populating basic sparse doc vector store, number of vectors: 26
Creating store of sparse vectors ...
Created 26 sparse random vectors.
Creating term vectors ...
There are 1137 terms (and 26 docs)
0 ... 1000 ...
Created 3 term vectors ...
Initializing document vector store ...
Building document vectors ...
0 ...
Normalizing doc vectors ...
Exception in thread "main" java.lang.NullPointerException
at pitt.search.semanticvectors.DocVectors.makeWriteableVectorStore
(DocVectors.java:143)
at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:231)



Perhaps you can explain this result to me. What should the
fieldsToIndex refer to if not the fields I'd like to search over?

Dominic

unread,
Dec 4, 2008, 2:07:06 PM12/4/08
to Semantic Vectors, nors...@gmail.com
Hi David,

It certainly true that fieldsToIndex should refer to the fields you
want to search over.

SemanticVectors has a term filter built into the TermVectorsFromLucene
class that filters out terms containing non-alphabetic characters:
this would explain why your IDs aren't generating terms. For the
"text" field or other fields, I'm not sure how to proceed without
being more familiar with the solr data model. Do you know what fields
solr uses in practice and how many there are?

You could easily hack TermVectorsFromLucene.java to index all fields.
If you want to index all terms (regardless of characters and fields),
you could comment out the check around line 157, which is where
termFilter() is called. Or you could hack the termFilter() method
itself around line 200 to dispense with the field filter.

I'm cc'ing Lance explicitly to see if he has any other suggestions
about what fields to use from solr.

Best wishes.
Dominic

Lance Norskog

unread,
Dec 4, 2008, 3:39:50 PM12/4/08
to Dominic, Semantic Vectors
In Solr you define your own fields. There is a modular system of Lucene analysers & token processors and you make stacks of these for each field for both indexing and querying. You can stem, stopword, throw in synonyms, turn European letters into British/US letters (think protege with and without accents). We make 3 indexed fields from the same raw text: exact, stemmed & stopworded, and phonetic for spelling correction. You can take several input fields and dump them all in one output field.

For a text search box you probably want stemmed & stopworded index
terms. For term searching you probably want stopworded & punctuation stripped but no stemming.  Stemming has a habit of creating a shorter term for unrelated words. This makes SV particularly weird, so you don't want stemming in your terms. You probably want everything as for a text search box except stemming.

So, to work with a Solr-controlled index you want to give the field name(s) on the command line. With multiple fields you probably want a global coefficient for the number of terms.

If you are doing text search indexing I strongly recommend looking at Solr. Netflix, for example, uses it for their main text search box.
--
"The Playboy reader invites a female acquaintance in for a quiet discussion of Picasso, Nietzsche, jazz, sex."  - Hugh Hefner

David

unread,
Dec 8, 2008, 9:58:21 AM12/8/08
to Semantic Vectors
Thank you all for the replies. I am still a bit confused as to how to
integrate. What do you mean by "to work with a Solr-controlled index
you want to give the field name(s)
on the command line"?

Here is the process I've tried unsuccessfully so far. I am using the
demo data from solr so we can all be on the same page. The fields I
have defined are:

<fields>
<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="sku" type="textTight" indexed="true" stored="true"
omitNorms="true"/>
<field name="name" type="text" indexed="true" stored="true"/>
<field name="nameSort" type="string" indexed="true" stored="false"/
>
<field name="alphaNameSort" type="alphaOnlySort" indexed="true"
stored="false"/>
<field name="manu" type="text" indexed="true" stored="true"
omitNorms="true"/>
<field name="cat" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" termVectors="true" />
<field name="features" type="text" indexed="true" stored="true"
multiValued="true"/>
<field name="includes" type="text" indexed="true" stored="true"/>
<field name="weight" type="sfloat" indexed="true" stored="true"/>
<field name="price" type="sfloat" indexed="true" stored="true"/>
<field name="popularity" type="sint" indexed="true" stored="true"
default="0"/>
<field name="inStock" type="boolean" indexed="true" stored="true"/>
<field name="word" type="string" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="false"
multiValued="true"/>
<field name="manu_exact" type="string" indexed="true"
stored="false"/>
<field name="timestamp" type="date" indexed="true" stored="true"
default="NOW" multiValued="false"/>
<field name="spell" type="textSpell" indexed="true" stored="true"
multiValued="true"/>
<dynamicField name="*_i" type="sint" indexed="true"
stored="true"/>
<dynamicField name="*_s" type="string" indexed="true"
stored="true"/>
<dynamicField name="*_l" type="slong" indexed="true"
stored="true"/>
<dynamicField name="*_t" type="text" indexed="true"
stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true"
stored="true"/>
<dynamicField name="*_f" type="sfloat" indexed="true"
stored="true"/>
<dynamicField name="*_d" type="sdouble" indexed="true"
stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true"
stored="true"/>
<dynamicField name="random*" type="random" />
</fields>

I then change line 198 in buildIndex to read:

String[] fieldsToIndex =
{"id","sku","name","nameSort","alphaNameSort","manu","cat","features","includes","weight","price","popularity","inStock","word","text","manu_exact","timestamp","spell"};

To build the solr index I run:

java -jar post.jar *.xml

this uploads all the exampledocs xml files that are included with
solr. I've run a few searches and the index is fine. I then try
builds SV's index by running:

java pitt.search.semanticvectors.BuildIndex ~/solr/solr/data/index/

I get the following error:

seedLength = 20
Vector length = 200
Minimum frequency = 10
Populating basic sparse doc vector store, number of vectors: 26
Creating store of sparse vectors ...
Created 26 sparse random vectors.
Creating term vectors ...
There are 1133 terms (and 26 docs)
0 ... 1000 ...
Created 5 term vectors ...
Initializing document vector store ...
Building document vectors ...
0 ...
Normalizing doc vectors ...
Exception in thread "main" java.lang.NullPointerException
at pitt.search.semanticvectors.DocVectors.makeWriteableVectorStore
(DocVectors.java:143)
at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:231)


What would I explicitly need to do to get SV to build an index of
solr's default xml set?

Thank you all again.

David

unread,
Dec 8, 2008, 10:04:08 AM12/8/08
to Semantic Vectors
In addition, commenting out the term check in
TermVectorsFromLucene.java:

/* skip terms that don't pass the filter */
// if (!termFilter(terms.term())) {
// continue;
// }

changes the error to:

seedLength = 20
Vector length = 200
Minimum frequency = 10
Populating basic sparse doc vector store, number of vectors: 26
Creating store of sparse vectors ...
Created 26 sparse random vectors.
Creating term vectors ...
There are 1133 terms (and 26 docs)
0 ... 1000 ...
Created 560 term vectors ...
Initializing document vector store ...
Building document vectors ...
0 ...
Normalizing doc vectors ...
Exception in thread "main" java.lang.NullPointerException
at pitt.search.semanticvectors.DocVectors.makeWriteableVectorStore
(DocVectors.java:143)
at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:231)


> ...
>
> read more »

David

unread,
Dec 8, 2008, 3:09:42 PM12/8/08
to Semantic Vectors
I have found the answer. In DocVectors.java line 136 the Lucene field
"path" did not exist in my solr index. Why does the path get used
for?

On Dec 8, 10:04 am, David <curioussemanticvectorsu...@gmail.com>
wrote:
> ...
>
> read more »

Lance Norskog

unread,
Dec 8, 2008, 6:18:16 PM12/8/08
to semanti...@googlegroups.com
SV also has an option to fetch the source text and highlight the found terms, and all documents with the terms that pulled them.  The 'path' field is the unique id for each document, since each document is in a separate directory. This is a 'second feature' of processing and is not necessary for Solr use - Solr has a highlighting feature already.

You could change 'path' to 'id' since this is the unique id in the Solr sample.

Dominic

unread,
Dec 8, 2008, 6:34:24 PM12/8/08
to Semantic Vectors
Making the makeWriteableVectorStore method look for an "id" field
would work in this case.

Basically DocVectors is at this point looking for the field to use as
a string identifier in the vector store written to disk. The
DocVectors.java class is unfortunately not very well encapsulated -
the two values hardcoded in the makeWriteableVectorStore method
correspond to the field "path" which is generated by the Lucene demo
and the field "filename" which is generated by the Semantic Vectors
BuildBilingualIndex. What should be happening is that the DocVectors
class should have member variables for both the id field and the
content fields to index, and these should be configured upon
instantiation.

If you want to file this as an issue I'll try to get to it in the next
release, I think this is a structural bug that should be fixed.

Thanks for persevering.
Best wishes,
Dominic

On Dec 8, 6:18 pm, "Lance Norskog" <norskh...@gmail.com> wrote:
> SV also has an option to fetch the source text and highlight the found
> terms, and all documents with the terms that pulled them.  The 'path' field
> is the unique id for each document, since each document is in a separate
> directory. This is a 'second feature' of processing and is not necessary for
> Solr use - Solr has a highlighting feature already.
>
> You could change 'path' to 'id' since this is the unique id in the Solr
> sample.
>
> On Mon, Dec 8, 2008 at 12:09 PM, David <curioussemanticvectorsu...@gmail.com
> > {"id","sku","name","nameSort","alphaNameSort","manu","cat","features","incl udes","weight","price","popularity","inStock","word","text","manu_exact","t imestamp","spell"};
> ...
>
> read more »

David

unread,
Dec 9, 2008, 10:29:42 AM12/9/08
to Semantic Vectors
Thank you both for your replies. I have one more concern and then I
think I will have integrated solr and SV well enough for my
purposes.

I have modified line 198 of build index to read String[] fieldsToIndex
= {"includes"};

According to Luke, there are only 10 records for this field in the
default data set. (cabl, usb, batteri, card, earbud,
headphon,sd,mb,av,32)

I then build SV's index with no term filters active.

I expect that fieldsToIndex pointing only to the "includes" field
would mean that only this field was searched over. This does not seem
to be the case. When I run a search for "apple" i get the following
result:

Opening query vector store from file: termvectors.bin
Dimensions = 200
Lowercasing term: apple
Searching term vectors, searchtype SUM ... Search output follows ...
1.0000001:applegbipodwithvideoplaybackblack
1.0000001:calendar
1.0000001:320
1.0000001:firmwar
1.0000001:264
1.0000001:phone
1.0000001:control
1.0000001:240
1.0000001:lla
1.0000001:tune
1.0000001:apple
1.0000001:game
1.0000001:jpeg
1.0000001:wallet
1.0000001:life
1.0000001:recharg
1.0000001:2008-12-08T14:38:15.522
1.0000001:150
1.0000001:aac
1.0000001:play

Is there a way to index only the fields I am interested in rather than
all available fields?

Thank you all again for the help.
> ...
>
> read more »

David

unread,
Dec 9, 2008, 11:59:46 AM12/9/08
to Semantic Vectors
If anyone else has a desire to get solr working with SV, here is what
is required to get the example data going so you can get a better idea
of how it all works. This is not complete. To get sliding window or
bilingual searches going more editing is required.

To get SV to work with solr for the default solr dataset (found in
solr/example/exampledocs), several SV files must be edited:
BuildIndex.java, DocVectors.java and TermVectorsFromLucene.java

Lines 136-138 in DocVectors should read:
if (this.indexReader.document(i).getField("id") !=
null) {
docName = this.indexReader.document(i).getField
("id").stringValue();
} else {
Line 198 in BuildIndex.java should be modified to include the fields
to be indexed, e.g.:
String[] fieldsToIndex = {"includes"};
Modify TermVectorsFromLucene.java lines 212-228 to not filter the
terms by character or frequency, i.d.:
// Character filter.
// String termText = term.text();
// for (int i = 0; i < termText.length(); ++i) {
// if (!Character.isLetter(termText.charAt(i))) {
// return false;
// }
// }
//
// // Freqency filter.
// int freq = 0;
// TermDocs tDocs = indexReader.termDocs(term);
// while (tDocs.next()) {
// freq += tDocs.freq();
// }
// if (freq < minFreq) {
// return false;
// }

There are example jars provided in solr/examples. Start up solr (from
solr/example: java -jar start.jar). Verify that solr is up (http://
localhost:8983/solr/admin/)
Index some files (from solr/example/exampledocs: java -jar post.jar
*.xml). You can verify that this worked on the statistics page of the
admin, reading the console, or using Luke (http://www.getopt.org/
luke/).
Use SV to build an index from Solr's index (java -cp lucene-
core.jar:semanticvectors.jar pitt.search.semanticvectors.BuildIndex
solr/solr/data/index)
java -cp lucene-core.jar:semanticvectors.jar
pitt.search.semanticvectors.Search usb

Should return:

Opening query vector store from file: termvectors.bin
Dimensions = 200
Lowercasing term: usb
Searching term vectors, searchtype SUM ... Search output follows ...
1.0:usb
0.9501752:cabl
0.6149186:32
0.6149186:av
0.6149186:batteri
0.6149186:card
0.6149186:sd
0.6149186:mb
0.58696777:earbud
0.58696777:headphon


Thanks everyone for all your help! It took me a bit but I think I
have a basic idea of what is going on now.

Lance Norskog

unread,
Dec 9, 2008, 8:04:11 PM12/9/08
to semanti...@googlegroups.com
A couple of notes:

The default Solr text processor indexes all of the text with stemming on - thus 'cabl', 'headphon', etc. Unrelated words which stem the same cause bogus search and SV results.  The example index includes special assistance for breaking up product letters & numbers nicely (it is for an electronics store).

To do searching more convenient, and to allow more useful text searches, we make two indexed fields: mostly raw and stemmed (and a few other things). The raw index has 'cable' and 'headphone'.

The Solr <copyField> command copies N data input fields into a generated field; you don't need to give any data for this field. I think the example index uses this to create a default search field (or it should). So we have two indexed fields each filled with a set of <copyField> commands.

This all goes under 'Solr practical advice', a wiki page that is unwritten.

Dominic

unread,
Dec 9, 2008, 8:56:00 PM12/9/08
to Semantic Vectors
You took the words right out of my mouth, Lance.

A wiki page on this topic would be great!

Please let me know if I can give any help, permissions, etc. if
someone decides to start this.

-Dominic

On Dec 9, 8:04 pm, "Lance Norskog" <norskh...@gmail.com> wrote:
> A couple of notes:
>
> The default Solr text processor indexes all of the text with stemming on -
> thus 'cabl', 'headphon', etc. Unrelated words which stem the same cause
> bogus search and SV results.  The example index includes special assistance
> for breaking up product letters & numbers nicely (it is for an electronics
> store).
>
> To do searching more convenient, and to allow more useful text searches, we
> make two indexed fields: mostly raw and stemmed (and a few other things).
> The raw index has 'cable' and 'headphone'.
>
> The Solr <copyField> command copies N data input fields into a generated
> field; you don't need to give any data for this field. I think the example
> index uses this to create a default search field (or it should). So we have
> two indexed fields each filled with a set of <copyField> commands.
>
> This all goes under 'Solr practical advice', a wiki page that is unwritten.
>
> On Tue, Dec 9, 2008 at 8:59 AM, David
> <curioussemanticvectorsu...@gmail.com>wrote:
> > luke/ <http://www.getopt.org/luke/>).
Reply all
Reply to author
Forward
0 new messages