ExtractingRequestHandler/Solr Cell Support

dano

unread,

Nov 16, 2010, 10:50:56 PM11/16/10

to rsolr

Is there any plans in the future with rsolr to support the
ExtractingRequestHandler for indexing binary docs?

matt mitchell

unread,

Nov 16, 2010, 11:51:03 PM11/16/10

to rsolr

Hey, I've had this request before and am considering using this
https://github.com/nicksieger/multipart-post to do the dirty work.

I'm thinking something like this:

solr.update("extract",
:data => open("data.html"),
:params => {"literal.id" => 1},
:headers => {"Content-Type"=>"text/html"})

So basically by assigning a File object to :data, RSolr would know
what to do when sending a POST/update request. Seem reasonable?

Matt

Dan Young

unread,

Nov 16, 2010, 11:58:02 PM11/16/10

to rs...@googlegroups.com

I'll need to look into using tika (or something else??) to extract out the content of my pdf files...and them I could possibly do something like you outline. This looks very promising and much cleaner that what I'm doing now with using typhoeus.

Regards,

Dan

--
You received this message because you are subscribed to the Google Groups "rsolr" group.
To post to this group, send email to rs...@googlegroups.com.
To unsubscribe from this group, send email to rsolr+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rsolr?hl=en.

dano

unread,

Dec 9, 2010, 1:58:02 AM12/9/10

to rsolr

Matt,

I'm trying this out but am having some issues when trying to post
using the solr.update. From a solr perspective, I need to post to the
following url http://localhost:8983//solr/<my core name>/update/
extract? but when I try solr.update 'extract' I get an error msg ,
Uncaught exception: wrong number of arguments (2 for 1). It looks
like the update methods can take a hash, with
the :data, :headers, :params ...but how can I specify the extract?
that I need to post to?

Regards,

Dan

On Nov 16, 9:51 pm, matt mitchell <goodie...@gmail.com> wrote:
> Hey, I've had this request before and am considering using thishttps://github.com/nicksieger/multipart-postto do the dirty work.

Matt Mitchell

unread,

Dec 9, 2010, 8:32:13 AM12/9/10

to rs...@googlegroups.com

Hi Dan,

Can you post your complete code + error stack?

Matt

Dan Young

unread,

Dec 9, 2010, 9:09:26 AM12/9/10

to rs...@googlegroups.com

Here's the relevant section:

solr.update 'extract', {:data => File.open("/tmp/#{solr_doc.cloud_id}", "rb") {|io| io.read },

:params => {"literal.aop_study_id"=> priorart.studyid,

"literal.a_study_title" => priorart.study.title,

"literal.a_code" => priorart.study.code,

"literal.a_industry" => priorart.study.industry,

"literal.a_category" => priorart.study.category,

"literal.a_owner" => priorart.study.owner,

"literal.a_priorart_id" => priorart.id,

"literal.a_priorart_title" => priorart.title,

"literal.a_cloudfile_id" => solr_doc.cloud_id,

"literal.a_priorart_submission_date" => priorart.created_timestamp,

"literal.a_doctype" => priorart.doctype,

"literal.a_priorart_author" => priorart.author,

"literal.a_priorart_isbn" => priorart.isbn,

"literal.a_advisor" => priorart.member.username,

"fmap.content" => 'text'},

:headers => {"Content-Type"=>solr_doc.content_type,"Content-Length"=>solr_doc.content_length}}

when I run this thru rdebug

/home/solr/util/solr/site_priorart/i.rb:46:in `block in <top (required)>'

/home/solr/util/solr/site_priorart/i.rb:41:in `each'

/home/solr/util/solr/site_priorart/i.rb:41:in `<top (required)>'

/usr/local/lib/ruby/gems/1.9.1/gems/ruby-debug19-0.11.5/bin/rdebug:125:in `debug_load'

/usr/local/lib/ruby/gems/1.9.1/gems/ruby-debug19-0.11.5/bin/rdebug:125:in `debug_program'

/usr/local/lib/ruby/gems/1.9.1/gems/ruby-debug19-0.11.5/bin/rdebug:412:in `<top (required)>'

/usr/local/bin/rdebug:19:in `load'

/usr/local/bin/rdebug:19:in `<main>'

Uncaught exception: wrong number of arguments (2 for 1)

Maybe I'm doing something completely wrong here, and assistance would be appreciated. I was wondering if I should be using build_request instead???

Regards,

Dan

Dan Young

unread,

Dec 9, 2010, 9:36:03 AM12/9/10

to rs...@googlegroups.com

I also get the same error when I try the method you outlined in an earlier email..

solr.update("extract",
:data => open("data.html"),
:params => {"literal.id" => 1},

:headers => {"Content-Type"=>"text/html"})

Regards,

Dan.

On Thu, Dec 9, 2010 at 6:32 AM, Matt Mitchell <good...@gmail.com> wrote:

Matt Mitchell

unread,

Dec 9, 2010, 10:43:10 AM12/9/10

to rs...@googlegroups.com

Hi Dan,

Sorry, the example I posted previously (:data => open()) was only an
example of what the code could look like if this feature were
implemented, but it's not actually yet. I do have something in master
you could look at and test if you'd like:

https://github.com/mwmitchell/rsolr/blob/master/lib/rsolr/connection.rb#L42

That should do the trick, but it's totally untested at this point.

For the other error, if you look at the update method:

https://github.com/mwmitchell/rsolr/blob/master/lib/rsolr/client.rb#L50

You'll see that update actually has a preset "path" of "update". I
think what you want to use it "post":

solr.post "update/extract", {}

I could adapt the update method to accept a path fragment to build on "update"?

So if you want to try master, do something like:

solr.post "update/extract", :data => open("data.html"), :params =>

{"literal.id" => 1}, :headers => {"Content-Type"=>"text/html"}

Matt

Dan Young

unread,

Dec 9, 2010, 10:48:00 AM12/9/10

to rs...@googlegroups.com

Matt, thank you....I'll look @ this right now....

Rock on!

Regards,

Dan

Dan Young

unread,

Dec 9, 2010, 1:51:36 PM12/9/10

to rs...@googlegroups.com

Looks like I've got it working...I ended up with this:

r = solr.post 'update/extract', :data => data,

:params => {

"literal.aop_study_id"=> priorart.studyid,

"literal.aop_study_title" => priorart.study.title,

"literal.aop_code" => priorart.study.code,

"literal.aop_industry" => priorart.study.industry,

"literal.aop_category" => priorart.study.category,

"literal.aop_owner" => priorart.study.owner,

"literal.aop_priorart_id" => priorart.id,

"literal.aop_priorart_title" => priorart.title,

"literal.aop_cloudfile_id" => solr_doc.cloud_id,

"literal.aop_priorart_submission_date" => priorart.created_timestamp,

"literal.aop_doctype" => priorart.doctype,

"literal.aop_priorart_author" => priorart.author,

"literal.aop_priorart_isbn" => priorart.isbn,

"literal.aop_advisor" => priorart.member.username,

"fmap.content" => 'text',

:lowernames => 'true',

:uprefix =>'ignored_'},

:headers => {"Content-Type"=>solr_doc.content_type,"Content-Length"=>solr_doc.content_length}

On a side note, is there a way to dump out the full url that rsolr uses?

Thank you for your help.

Regards,

Dan

Matt Mitchell

unread,

Dec 9, 2010, 2:01:12 PM12/9/10

to rs...@googlegroups.com

Yes, you but you'll have to build the request first:

request = solr.build_request('update/extract', :data=>xxx, :params => {})
puts request[:uri].to_s
response = solr.execute(request)

... that *should* work.

Matt

Dan Young

unread,

Dec 9, 2010, 2:16:07 PM12/9/10

to rs...@googlegroups.com

Ok great.....I'll give that a whirl too...

Dano

Reply all

Reply to author

Forward