ExtractingRequestHandler/Solr Cell Support

90 views
Skip to first unread message

dano

unread,
Nov 16, 2010, 10:50:56 PM11/16/10
to rsolr
Is there any plans in the future with rsolr to support the
ExtractingRequestHandler for indexing binary docs?

matt mitchell

unread,
Nov 16, 2010, 11:51:03 PM11/16/10
to rsolr
Hey, I've had this request before and am considering using this
https://github.com/nicksieger/multipart-post to do the dirty work.

I'm thinking something like this:

solr.update("extract",
:data => open("data.html"),
:params => {"literal.id" => 1},
:headers => {"Content-Type"=>"text/html"})

So basically by assigning a File object to :data, RSolr would know
what to do when sending a POST/update request. Seem reasonable?

Matt

Dan Young

unread,
Nov 16, 2010, 11:58:02 PM11/16/10
to rs...@googlegroups.com
I'll need to look into using tika (or something else??) to extract out the content of my pdf files...and them I could possibly do something like you outline. This looks very promising and much cleaner that what I'm doing now with using typhoeus. 

Regards,

Dan


--
You received this message because you are subscribed to the Google Groups "rsolr" group.
To post to this group, send email to rs...@googlegroups.com.
To unsubscribe from this group, send email to rsolr+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rsolr?hl=en.


dano

unread,
Dec 9, 2010, 1:58:02 AM12/9/10
to rsolr
Matt,

I'm trying this out but am having some issues when trying to post
using the solr.update. From a solr perspective, I need to post to the
following url http://localhost:8983//solr/<my core name>/update/
extract? but when I try solr.update 'extract' I get an error msg ,
Uncaught exception: wrong number of arguments (2 for 1). It looks
like the update methods can take a hash, with
the :data, :headers, :params ...but how can I specify the extract?
that I need to post to?

Regards,

Dan


On Nov 16, 9:51 pm, matt mitchell <goodie...@gmail.com> wrote:
> Hey, I've had this request before and am considering using thishttps://github.com/nicksieger/multipart-postto do the dirty work.

Matt Mitchell

unread,
Dec 9, 2010, 8:32:13 AM12/9/10
to rs...@googlegroups.com
Hi Dan,

Can you post your complete code + error stack?

Matt

Dan Young

unread,
Dec 9, 2010, 9:09:26 AM12/9/10
to rs...@googlegroups.com
Here's the relevant section:

              solr.update 'extract', {:data => File.open("/tmp/#{solr_doc.cloud_id}", "rb") {|io| io.read },
               :params => {"literal.aop_study_id"=> priorart.studyid,
                 "literal.a_study_title" => priorart.study.title,
                 "literal.a_code" => priorart.study.code,
                 "literal.a_industry" => priorart.study.industry,
                 "literal.a_category" => priorart.study.category,
                 "literal.a_owner" => priorart.study.owner,
                 "literal.a_priorart_id" => priorart.id,
                 "literal.a_priorart_title" => priorart.title,
                 "literal.a_cloudfile_id" => solr_doc.cloud_id,
                 "literal.a_priorart_submission_date" => priorart.created_timestamp,
                 "literal.a_doctype" => priorart.doctype,
                 "literal.a_priorart_author" => priorart.author,
                 "literal.a_priorart_isbn" => priorart.isbn,
                 "literal.a_advisor" => priorart.member.username,
                 "fmap.content" => 'text'},
               :headers => {"Content-Type"=>solr_doc.content_type,"Content-Length"=>solr_doc.content_length}}

when I run this thru rdebug
/home/solr/util/solr/site_priorart/i.rb:46:in `block in <top (required)>'
/home/solr/util/solr/site_priorart/i.rb:41:in `each'
/home/solr/util/solr/site_priorart/i.rb:41:in `<top (required)>'
/usr/local/lib/ruby/gems/1.9.1/gems/ruby-debug19-0.11.5/bin/rdebug:125:in `debug_load'
/usr/local/lib/ruby/gems/1.9.1/gems/ruby-debug19-0.11.5/bin/rdebug:125:in `debug_program'
/usr/local/lib/ruby/gems/1.9.1/gems/ruby-debug19-0.11.5/bin/rdebug:412:in `<top (required)>'
/usr/local/bin/rdebug:19:in `load'
/usr/local/bin/rdebug:19:in `<main>'
Uncaught exception: wrong number of arguments (2 for 1)

Maybe I'm doing something completely wrong here, and assistance would be appreciated.  I was wondering if I should be using build_request instead???

Regards,

Dan

Dan Young

unread,
Dec 9, 2010, 9:36:03 AM12/9/10
to rs...@googlegroups.com
I also get the same error when I try the method you outlined in an earlier email..

solr.update("extract",
 :data => open("data.html"),
 :params => {"literal.id" => 1},
 :headers => {"Content-Type"=>"text/html"})

Regards,

Dan.

On Thu, Dec 9, 2010 at 6:32 AM, Matt Mitchell <good...@gmail.com> wrote:

Matt Mitchell

unread,
Dec 9, 2010, 10:43:10 AM12/9/10
to rs...@googlegroups.com
Hi Dan,

Sorry, the example I posted previously (:data => open()) was only an
example of what the code could look like if this feature were
implemented, but it's not actually yet. I do have something in master
you could look at and test if you'd like:

https://github.com/mwmitchell/rsolr/blob/master/lib/rsolr/connection.rb#L42

That should do the trick, but it's totally untested at this point.

For the other error, if you look at the update method:

https://github.com/mwmitchell/rsolr/blob/master/lib/rsolr/client.rb#L50

You'll see that update actually has a preset "path" of "update". I
think what you want to use it "post":

solr.post "update/extract", {}

I could adapt the update method to accept a path fragment to build on "update"?

So if you want to try master, do something like:

solr.post "update/extract", :data => open("data.html"), :params =>


{"literal.id" => 1}, :headers => {"Content-Type"=>"text/html"}

Matt

Dan Young

unread,
Dec 9, 2010, 10:48:00 AM12/9/10
to rs...@googlegroups.com
Matt, thank you....I'll look @ this right now....

Rock on!


Regards,

Dan

Dan Young

unread,
Dec 9, 2010, 1:51:36 PM12/9/10
to rs...@googlegroups.com
Looks like I've got it working...I ended up with this:

               r = solr.post 'update/extract', :data => data,
               :params => {
                 "literal.aop_study_id"=> priorart.studyid,
                 "literal.aop_study_title" => priorart.study.title,
                 "literal.aop_code" => priorart.study.code,
                 "literal.aop_industry" => priorart.study.industry,
                 "literal.aop_category" => priorart.study.category,
                 "literal.aop_owner" => priorart.study.owner,
                 "literal.aop_priorart_id" => priorart.id,
                 "literal.aop_priorart_title" => priorart.title,
                 "literal.aop_cloudfile_id" => solr_doc.cloud_id,
                 "literal.aop_priorart_submission_date" => priorart.created_timestamp,
                 "literal.aop_doctype" => priorart.doctype,
                 "literal.aop_priorart_author" => priorart.author,
                 "literal.aop_priorart_isbn" => priorart.isbn,
                 "literal.aop_advisor" => priorart.member.username,
                 "fmap.content" => 'text',
                 :lowernames => 'true',
                 :uprefix =>'ignored_'},
               :headers => {"Content-Type"=>solr_doc.content_type,"Content-Length"=>solr_doc.content_length}

On a side note, is there a way to dump out the full url that rsolr uses?

Thank you for your help.

Regards,

Dan

Matt Mitchell

unread,
Dec 9, 2010, 2:01:12 PM12/9/10
to rs...@googlegroups.com
Yes, you but you'll have to build the request first:

request = solr.build_request('update/extract', :data=>xxx, :params => {})
puts request[:uri].to_s
response = solr.execute(request)

... that *should* work.

Matt

Dan Young

unread,
Dec 9, 2010, 2:16:07 PM12/9/10
to rs...@googlegroups.com
Ok great.....I'll give that a whirl too...

Dano
Reply all
Reply to author
Forward
0 new messages