Behemoth to SolR tutorial?

100 views
Skip to first unread message

Alex McLintock

unread,
Oct 6, 2013, 1:33:34 PM10/6/13
to digita...@googlegroups.com
Has anyone written a Behemoth to SolR tutorial? I have read various discussions about it which have taken place here. I have read the SolR Writer.java. 

Basically I have exported from Behemoth to SolR my documents, but despite adding in fields to both Behemoth and SolR I can only search them through the main text, not individual facets. 

I would love to know that someone else has gotten this working, and how it worked for them. 

If no one has done a tutorial or example then I'll write something more about what I have done and people can see if there is anything I have missed out. 

Alex

DigitalPebble

unread,
Oct 7, 2013, 4:11:16 AM10/7/13
to digita...@googlegroups.com
Hi Alex

There is no tutorial for SOLR, just https://github.com/DigitalPebble/behemoth/wiki/Solr-module which could be updated indeed.

Re-your issue : can you see the fields added by Behemoth in the documents? if so check your SOLR config and make sure the fields are indexed, analysed etc... definitely sounds more like an issue with your SOLR config then Behemoth 
 
It's always good to have more documentation. If you write a blog entry about your work with Behemoth I will have a look and check for things you night have missed out

Thanks

Julien




--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/groups/opt_out.



--
 
Open Source Solutions for Text Engineering
 
http://digitalpebble.blogspot.com
http://www.digitalpebble.com

Alex McLintock

unread,
Oct 7, 2013, 11:20:49 AM10/7/13
to digita...@googlegroups.com
Hi Julien, 

It is a bit difficult to know whether I have got the SolR config right without understanding what "right" is. 
For instance I cant understand this code from the SolRWriter.java


        // TODO document this param on the wiki
        // process solr.annotations.list
        String list = job.get("solr.annotations.list");
        if (list == null || list.trim().length() == 0) {
            return;
        }
        String[] names = list.split("\\s+");
        for (String name : names) {
            // support all annotations denoted by '*'
            if (name.equals("*")) {
                includeAllAnnotations = true;
            } else {
                String solrFieldName = "annotation_" + name;
                populateMapping(solrFieldName, name);
            }
        }
    }


I set that parameter to "*" not knowing what I should set it to. It now seems that SolR is choking when trying to import documents sent to it by Behemoth with the error....



ERROR - 2013-10-07 16:13:33.308; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ERROR: [doc=file:///home/alex/projects/documentFileNameRedacted] unknown field 'annotation_meta.content'
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:174)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:210)


Now I dont know whether I am supposed to have a field annotation_meta, or not. What is the relevance of the ".content" on the end. 

I have added other fields into SolR eg


   <field name="persontitle" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="jobtitle" type="text_general" indexed="true" stored="true"/>
   <field name="skill" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="organisation" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="company" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="address" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="person" type="text_general" indexed="true" stored="true"/>
   <field name="qualification" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="university" type="text_general" indexed="true" stored="true" multiValued="true"/>

Should I also add

   <field name="annotation_meta" type="text_general" indexed="true" stored="true" multiValued="true"/>

And others I dont know about?


My issue is that unless there is a simple example I don't really know if a problem is a Behemoth or a SolR problem. 

Alex

DigitalPebble

unread,
Oct 10, 2013, 4:56:41 AM10/10/13
to digita...@googlegroups.com
Hi Alex

Comments below. BTW The hadoop logs provide you useful information about the field mapping in the SOLR plugin.

On 7 October 2013 16:20, Alex McLintock <alex.mc...@gmail.com> wrote:
Hi Julien, 

It is a bit difficult to know whether I have got the SolR config right without understanding what "right" is. 
 
For instance I cant understand this code from the SolRWriter.java


        // TODO document this param on the wiki
        // process solr.annotations.list
        String list = job.get("solr.annotations.list");
        if (list == null || list.trim().length() == 0) {
            return;
        }
        String[] names = list.split("\\s+");
        for (String name : names) {
            // support all annotations denoted by '*'
            if (name.equals("*")) {
                includeAllAnnotations = true;
            } else {
                String solrFieldName = "annotation_" + name;
                populateMapping(solrFieldName, name);
            }
        }
    }


I set that parameter to "*" not knowing what I should set it to. It now seems that SolR is choking when trying to import documents sent to it by Behemoth with the error....

     if (name.equals("*")) {
                includeAllAnnotations = true;
            }

Setting * as value will try and add all the annotations to SOLR which is Ok if you want to use dynamic fields in SOLR but makes no sense otherwise, especially if you have Token annotations. 

solr.annotations.list prefixes the solr fields generated with 'annotation_' which makes it easier to use dynamic fields

You don't have to use solr.annotations.list - see below
 



ERROR - 2013-10-07 16:13:33.308; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ERROR: [doc=file:///home/alex/projects/documentFileNameRedacted] unknown field 'annotation_meta.content'
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:174)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:73)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:210) 

you haven't defined that field in your SOLR schema and it crashes the indexing. That's because it generated a field for each annotation type

 
Now I dont know whether I am supposed to have a field annotation_meta, or not. What is the relevance of the ".content" on the end.  

.content is a feature of your annotation meta - since you used * it uses all the annotations names + feature names

 
 

I have added other fields into SolR eg


   <field name="persontitle" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="jobtitle" type="text_general" indexed="true" stored="true"/>
   <field name="skill" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="organisation" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="company" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="address" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="person" type="text_general" indexed="true" stored="true"/>
   <field name="qualification" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="university" type="text_general" indexed="true" stored="true" multiValued="true"/>

Should I also add

   <field name="annotation_meta" type="text_general" indexed="true" stored="true" multiValued="true"/>

And others I dont know about

Just declare the ones you need and list them with the solr.f.* parameter

// solr.f.name = BehemothType.featureName
// e.g. solr.f.person = Person.string will map the "string" feature of
// "Person" annotations onto the Solr field "person"

Specifying * for feature name (e.g. solr.f.Person = Person.*) will use the text covered by the annotation as value

to generate a field 'university' and assuming you have an annotation University in your Behemoth doc you would define

<property>
<name>solr.f.university</name>
<value>University.*</value>
</property>

to use the text covered by the University annotations as a value

If you have say a Person annotation with a feature 'title' you could use those in the persontitle field with 

<property>
<name>solr.f.persontitle</name>
<value>Person.title</value>
</property>


 

My issue is that unless there is a simple example I don't really know if a problem is a Behemoth or a SolR problem. 

Neither of them as such but with both your configs in B and Solr

Hope this helps!

J.

zr

unread,
Nov 13, 2014, 5:03:09 PM11/13/14
to digita...@googlegroups.com, jul...@digitalpebble.com
A tutorial or a few additional notes in the wiki would be really useful here. I'm having the same problem preserving the fields in my Solr XML documents and am not sure I clearly understand the response posted here. Do I need to define fields in my solr schema? What about in schemaless mode? Where does one make changes to the Behemoth config?

Zach

DigitalPebble

unread,
Nov 14, 2014, 7:43:49 AM11/14/14
to zr, digita...@googlegroups.com
Hi Zach

A tutorial or a few additional notes in the wiki would be really useful here. I'm having the same problem preserving the fields in my Solr XML documents and am not sure I clearly understand the response posted here. Do I need to define fields in my solr schema?

yes you do
 
What about in schemaless mode?

I haven't tried it yet but I can't think of a reason why it shouldn't work
 
Where does one make changes to the Behemoth config?




--

zach.r...@gmail.com

unread,
Nov 18, 2014, 2:44:40 PM11/18/14
to digita...@googlegroups.com, zach.r...@gmail.com, jul...@digitalpebble.com
Hi Julien

Thanks for the reply. I may be misreading the thread. Let me re-state the problem. As you can tell, I am new to Behemoth and Hadoop. Sorry in advance for the simple questions!

I have several million documents in Solr XML format like this:

<add>
  <doc>
   <field name="rating">44</field>
    <field name="title">Some title</field>
    <field name="text">Some text ...</field>
    <field name="cat">News category</field>
    <field name="timestamp">0594-09-02T00:00:00Z</field>
    <field name="id">1412728087373312</field>
   </doc>
</add>

The goal is to index these all in Solr using Behemoth. If I understand correctly, I first generate a Behemoth corpus from these docs and then run a SOLRIndexerJob on that corpus. Is that correct? Do I need to parse the contents with TIKA also? 

What has to go into behemoth-site.xml file to make sure the solr fields are indexed properly?

Thanks!
Zach

DigitalPebble

unread,
Nov 19, 2014, 6:15:07 AM11/19/14
to digita...@googlegroups.com
Hi Zach

Why don't you use DIH (https://wiki.apache.org/solr/DataImportHandler) for this?

Unless you want to enrich the data in any way, Behemoth is probably a bit of an overkill for simply loading these files into SOLR. 

You are right that after generating a Behemoth corpus you'd have to run Tika to convert the XML markup into annotations, however IIRC in the current version of the SOLRWriter you would not be able to use the value of an attribute (e.g. rating) as a field name and the underlying text as a value.

That class could be modified to that effect if necessary but it is probably not the easiest way to load these docs into SOLR to be honest.

Julien

To post to this group, send email to digita...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zach Rowinski

unread,
Nov 19, 2014, 1:28:49 PM11/19/14
to digita...@googlegroups.com
Hi Julien, 

​Thank you for clarifying. We are mainly looking for a way to apply map-reduce to indexing a arbitrarily large amount of documents. Solr doesn't currently support map-reduce through DIH or any of the other means, though that may change with SOLR-1045. I'll continue to look around or see about implementing something internally. Thanks again for taking a moment to reply!

Zach

You received this message because you are subscribed to a topic in the Google Groups "DigitalPebble" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/digitalpebble/YaMkSFHxSs4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to digitalpebbl...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages