SOLR indexing for GATE behemoth document

434 views
Skip to first unread message

madhum...@gmail.com

unread,
Mar 31, 2013, 6:25:23 AM3/31/13
to digita...@googlegroups.com
I am new to Behemoth, SOLR as well as Hadoop. I am using behemoth SOLR job to index a behemoth document which has been previously processed using behemoth gate job. I wish to index all the annotations and features generated using my gate processing application, as fields in SOLR schema. But when I run the job, only the text is indexed and the annotations are skipped. I need help in indexing all the annotations. How can it be done? 

DigitalPebble

unread,
Apr 1, 2013, 6:28:24 AM4/1/13
to digita...@googlegroups.com
Hi 


You need to define solr.f.* params in the Behemoth-site.xml file with the field name in the param name e.g. solr.f.person and the value points to the annotation type and optionally feature name.

HTH

Julien



On 31 March 2013 11:25, eart...@gmail.com <madhum...@gmail.com> wrote:
I am new to Behemoth, SOLR as well as Hadoop. I am using behemoth SOLR job to index a behemoth document which has been previously processed using behemoth gate job. I wish to index all the annotations and features generated using my gate processing application, as fields in SOLR schema. But when I run the job, only the text is indexed and the annotations are skipped. I need help in indexing all the annotations. How can it be done? 

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
 
Open Source Solutions for Text Engineering
 
http://digitalpebble.blogspot.com
http://www.digitalpebble.com

madhum...@gmail.com

unread,
Apr 14, 2013, 10:37:30 AM4/14/13
to digita...@googlegroups.com, jul...@digitalpebble.com
Thank you so much. This solves my issue. 

Madhumita

jimmyn...@gmail.com

unread,
Sep 17, 2013, 9:47:31 PM9/17/13
to digita...@googlegroups.com, jul...@digitalpebble.com
Hello Julien,

does that mean that we should add (to behemoth-site.xml) lines like the following ?

<property>
      <name>solr.f.cat</name>
      <value>Token.string</value>
</property>

(I know it is not recommended to index the tokens but I'm just testing and my test document is really short)

It doesn't work on my side... I get my documents into solr, but I can't seem to be able to index the annotations.

Also, I don't really understand where the document's id and version come from... Is it even possible to see the content of the document that Behemoth passes to Solr ?

Thanks for your help !
Jim

jimmyn...@gmail.com

unread,
Sep 17, 2013, 9:48:00 PM9/17/13
to digita...@googlegroups.com, jul...@digitalpebble.com
Hello Julien,

does that mean that we should add (to behemoth-site.xml) lines like the following ?

<property>
      <name>solr.f.cat</name>
      <value>Token.string</value>
</property>

(I know it is not recommended to index the tokens but I'm just testing and my test document is really short)

It doesn't work on my side... I get my documents into solr, but I can't seem to be able to index the annotations.

Also, I don't really understand where the document's id and version come from... Is it even possible to see the content of the document that Behemoth passes to Solr ?

Thanks for your help !
Jim


On Monday, April 1, 2013 6:28:24 AM UTC-4, DigitalPebble wrote:

Julien Nioche

unread,
Sep 18, 2013, 4:14:09 AM9/18/13
to digita...@googlegroups.com, jimmyn...@gmail.com
Hi jimmy

On Wednesday, 18 September 2013 02:47:31 UTC+1, jimmyn...@gmail.com wrote:
Hello Julien,

does that mean that we should add (to behemoth-site.xml) lines like the following ?

<property>
      <name>solr.f.cat</name>
      <value>Token.string</value>
</property>

(I know it is not recommended to index the tokens but I'm just testing and my test document is really short)

yes. 

 

It doesn't work on my side... I get my documents into solr, but I can't seem to be able to index the annotations.

You also need to specify solr.annotations=true
 

Also, I don't really understand where the document's id and version come from... Is it even possible to see the content of the document that Behemoth passes to Solr ?

The id is the value of the getUrl() on the Behemoth doc. The version field is probably added by SOLR itself. See https://github.com/DigitalPebble/behemoth/blob/master/solr/src/main/java/com/digitalpebble/behemoth/solr/SOLRWriter.java

You can view the content of a  Behemoth doc using the CorpusReader command (see https://github.com/DigitalPebble/behemoth/wiki/tutorial)

HTH

Julien 

jimmyn...@gmail.com

unread,
Sep 18, 2013, 6:40:48 PM9/18/13
to digita...@googlegroups.com, jimmyn...@gmail.com
Hi Julien,

this is exactly what I needed, thank you so much for your quick reply !

Jim

jimmyn...@gmail.com

unread,
Sep 18, 2013, 11:46:46 PM9/18/13
to digita...@googlegroups.com, jimmyn...@gmail.com
Hello again,

the annotations produced by my GATE application usually have several features (for example, annotation type Person has the following features : Person.title, Person.firstName, Person.lastName, Person.gender). 
Each of my documents may contain more than one Person annotation, which is why I would like to index all the features for one annotation in one solr field. 
How do I do that ?

I thought I'd add the following lines in schema.xml :

<types>
...
<fieldType name="person" class="..." subSuffix="_person" />
...
</types>
...
<fields>
...
<field name="personinfo" type="person" indexed="true" stored="true" multiValued="true" />
<dynamicField name="*_person" type="text_general" indexed="true" stored="false" />
...
</fields>

and then I'd have the corresponding subtypes in behemoth-site.xml :

<property>
<name>solr.f.title_person</name>
<value>Person.title</value>
</property>

Is it a good start ?

Also, now I would't know what class I should use for my "person" fieldType... Would I have to program a new Java class myself or is there an easier way to do it ?

Thanks !
J.

DigitalPebble

unread,
Sep 19, 2013, 3:09:32 PM9/19/13
to digita...@googlegroups.com
Hi Jimmy

That's definitely more of a SOLR question. I see that you posted it on the solr mailing list (http://www.mail-archive.com/solr...@lucene.apache.org/msg89067.html), let's see what the people there can suggest. I haven't used subfields at all, so I can't really help you here.

Alternatively you could store the additional info as a payload but I am not sure how supported that is in SOLR + it would certainly require some coding both in Behemoth and in SOLR.

Julien


--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

jimmyn...@gmail.com

unread,
Sep 19, 2013, 6:31:11 PM9/19/13
to digita...@googlegroups.com, jul...@digitalpebble.com
Hi Julien,

Sorry for posting here, I realize this is definitely more of a solr-related question.
I'm still looking into defining my own fieldtypes.

Cheers,
J.

jimmyn...@gmail.com

unread,
Oct 3, 2013, 6:35:24 PM10/3/13
to digita...@googlegroups.com, jul...@digitalpebble.com, jimmyn...@gmail.com
Hello again,

I still didn't find out how to resolve my problem and as a workaround, I was wondering if it is possible to add attributes to our Solr fields ?

As I mentioned earlier, I'm indexing GATE-annotated documents into solr. The annotations produced by my GATE application usually have several features (for example, Person.title, Person.name, Person.phoneNumber...). 
Now each of my documents may contain more than one Person annotation, and each person might have more than one phone number... Unfortunately I don't know how to index all the features for one annotation in one field in solr.

So instead, I would like to add an attribute "id" (or "offset") to each of the features I'm sending to Solr in order to be able to find out, for example, which Person.name goes with which Person.phoneNumber.

So instead of:
<doc> <str name="id">1</str>
<arr name="person"> <str>Jane Doe</str> <str>John Doe</str>
<arr name="person_phoneNumber"> <str>0123456789</str> <str>1234567890</str>
<str>2345678901</str>
</doc>

I'd like to get something like this in Solr:
<doc>
<str name="id">1</str>
<arr name="person"> <str id="1">Jane Doe</str> <str id="2">John Doe</str>
<arr name="person_phoneNumber"> <str id="1">0123456789</str> <str id="1">1234567890</str>
<str id="2">2345678901</str>
</doc>

This way it is easy to link the 2 first phone numbers to Jane Doe and the last one to John Doe.

Any idea ?

Thanks !
Jim

DigitalPebble

unread,
Oct 4, 2013, 3:53:29 AM10/4/13
to digita...@googlegroups.com
Hi Jim, 

You could use payloads in SOLR but its gets quite complex to use (see old post http://digitalpebble.blogspot.co.uk/2010/08/using-payloads-with-dismaxqparser-in.html). There might be another way of doing in SOLR but I can't think of one right now.

An easier approach would be to use ElasticSearch and send a structured document. Nested types could be useful (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-nested-type.html).  That would allow you to index a person as a json object e.g. with named fields for representing the features and link that to the original document.

There is a ridiculously primitive Behemoth module for ElasticSearch on https://github.com/DigitalPebble/behemoth-elasticsearch which you could use as a starting point.

HTH

Julien



--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/groups/opt_out.

jimmyn...@gmail.com

unread,
Oct 4, 2013, 12:27:01 PM10/4/13
to digita...@googlegroups.com, jul...@digitalpebble.com
Hello Julien,

and thank you for your answer.

I tried to simplify my problem but I realize I chose a bad example : I don't process phone numbers, and I do process unstructured documents.

My GATE application might return several annotations for the same group of words (because I'm using an ontology). So for example, I will have an Animal annotation, which marks the words "cat", "catfish" and "eider" as Animal(s), and (depending on the ontology used) the "cat" annotation will have 2 features : Animal.class=mammal and Animal.class="cat", the "catfish" will have 1 feature Animal.class=fish, and the more specific term "eider" will have 2 features : Animal.class=bird, Animal.class=duck.

I don't want to consider 1 solr "document" for each animal, I really want 1 index for each actual document. I'd like to be able to query my solr index for "bird" and get all the documents containing the terms "bird", or any subclass or instance (like "duck" or "eider"). Since all the words "bird", "duck" and "eider" appearing in my documents will be tagged as Animal and there will be an annotation with Animal.class=bird, it is easy to get Solr to return the right documents.

But since I get something like : 

<result>
  <doc>
    <str name="id">hdfs://...</str>
    <arr name="animal">
      <str>cat</str>
      <str>cat</str>
      <str>catfish</str>
      <str>eider</str>
      <str>eider</str>
    </arr>
    <arr name="class">
      <str>mammal</str>
      <str>cat</str>
      <str>fish</str>
      <str>bird</str>
      <str>duck</str>
    </arr>
    <arr name="instance">
      <str>http://.../Animal#catfish</str>
      <str>http://.../Animal#eider</str>
      <str>http://.../Animal#eider</str>
    </arr>
  </doc>
  <doc>
     ...
  </doc>
  <doc>
     ...
  </doc>
</result>

... when I want to generate a snippet of the document and highlight the terms whose appearance made solr return the document (like the first document containing "eider" when the user is searching for a "bird"), I'd like to highlight the term "eider" in the snippet, but I don't know how to do that. Having a correspondance between my solr "animal" and "class" fields (for example, an id attribute that would link them : <str id="5">eider</str> and the same id for the class "bird") would make it easier to highlight my term "eider".

What do you think ?

Thanks !
Jim

DigitalPebble

unread,
Oct 4, 2013, 2:37:38 PM10/4/13
to digita...@googlegroups.com
My suggestion was to use ES and not to have one doc per term but using nested docs. Anyway, what you want is to associate multiple tokens for the same offsets in the original text so that the highlights return the text. These tokens would correspond to the various classes and features that exist for a given entity. That's what used internally by SOLR to do synonyms for instance. The challenge in your case is to find a way of passing this information to SOLR and get it to generate the right tokens. You would probably need to write a custom analyzer for this. 

Can you tell us more about your project in general and the context in which you do the search? e.g could you use Lucene instead? What volume of data do you want to index?

Have you looked at http://gate.ac.uk/mimir/ ?

Again, ask this on the SOLR list, you should get good advice there

Julien



--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble.
For more options, visit https://groups.google.com/groups/opt_out.
Reply all
Reply to author
Forward
0 new messages