Solr content extraction ("Cell")

194 views
Skip to first unread message

Nick Zadrozny

unread,
Sep 20, 2010, 2:50:33 PM9/20/10
to ruby-s...@googlegroups.com
Hey all,

So with the impending release of Sunspot 1.2 and the resurrected interest in Solr Cell integration, I thought I'd start a fresh email thread on the subject. I, too, would like to see Solr Cell support added to the Sunspot DSL, and I've been brainstorming with Mat a bit on IRC to figure out how to do that appropriately while still being able to efficiently provide a bundled Solr server that supports Sunspot.

There is already a branch or two out there with some level of Cell support added to the Sunspot DSL. Notably http://github.com/isaac/sunspot/tree/cell — which I happen to know is currently being used by some in production.

Once Sunspot 1.2 final is pushed, I'd love to see this branch get updated and merged into master for consideration for 1.3. Toward that end, I've started my own cell branch by rebasing Isaac's against the latest Sunspot master. I'll spend some time poking around, getting specs working, fixing regressions, that sort of thing, then trimming out any unnecessary changes from master.

If anyone (Isaac?) wants to help with that, my branch is here:

I'm looking for patches that…
1. Fix broken specs and regressions
2. Pare down unnecessary changes from master

Find me in IRC (nzadrozny in #sunspot-ruby) if you are interested in seeing this get done.

--
Nick Zadrozny

Sam

unread,
Sep 20, 2010, 6:18:26 PM9/20/10
to Sunspot
Awesome news Nick thanks. I have forked the project and I will take a
deeper look into it when I have a chance. A couple of questions
though:

I assume you are working on the Cell branch?
Is it working (but an older version of sunspot) or not working but up
to date with sunspot?

Thanks

On Sep 21, 6:50 am, Nick Zadrozny <n...@onemorecloud.com> wrote:
> Hey all,
>
> So with the impending release of Sunspot 1.2 and the resurrected interest in
> Solr Cell integration, I thought I'd start a fresh email thread on the
> subject. I, too, would like to see Solr Cell support added to the Sunspot
> DSL, and I've been brainstorming with Mat a bit on IRC to figure out how to
> do that appropriately while still being able to efficiently provide a
> bundled Solr server that supports Sunspot.
>
> There is already a branch or two out there with some level of Cell support
> added to the Sunspot DSL. Notablyhttp://github.com/isaac/sunspot/tree/cell—which I happen to know is

Nick Zadrozny

unread,
Sep 20, 2010, 6:30:49 PM9/20/10
to ruby-s...@googlegroups.com
On Mon, Sep 20, 2010 at 3:18 PM, Sam <ooto...@gmail.com> wrote:
I assume you are working on the Cell branch?

Yep.
 
Is it working (but an older version of sunspot) or not working but up
to date with sunspot?

Isaac's is the former, mine is the latter — up to date, but not working. At least, the specs aren't running. There's some amount of (hopefully superficial) brokenness from updating to the latest Sunspot, which I'll look into some time in the next few days.

--
Nick Zadrozny

Sam

unread,
Sep 21, 2010, 11:17:17 PM9/21/10
to Sunspot
So, I am using Solr 1.4.1 installed with Homebrew on OS X 10.6 and
also using Solr 1.4.0+ds1-1ubuntu1 on Ubuntu 10.04.

I am trying to narrow down where these systems are having problems in
indexing the content of attachments so I have been trying to test the
Solr installs independent of Sunspot to see if I can get it to index
files. So far when I use the command:

curl 'http://localhost:8983/solr/update/extract?
literal.id=doc1&commit=true' -F "myfile=@ATestDoc.pdf"

I get a "lazy loading error" on both systems. Does this mean I am
missing something in my Solr installs?

The file exists in the directory were I make the cal, a Solr process
is running (on both systems) and I've checked the ports and the
Sunspot search works on the site I am building (minus the file content
search).

I figure ether the above curl command is an incorrect way to test Solr
Cell or both installations of Solr Cell are missing a required
extension or something (I think Cell comes with the Solr 1.4 install
but I'm not 100% sure on this). Any help would be appreciated.

On Sep 21, 10:30 am, Nick Zadrozny <n...@onemorecloud.com> wrote:

Isaac

unread,
Sep 20, 2010, 9:24:54 PM9/20/10
to Sunspot
Hey Guys,

My cell branch was a bit of a mess so I have cherry-picked the
necessary commits and reapplied them on top of outoftime/sunspot
master. You can find all the changes on my master branch here:
http://github.com/isaac/sunspot

All the attachment specs now pass for me - note that you need to
follow the instructions here from Matt about the Solr Cell .jar and
dependencies here: http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell

Cheers,
Isaac

On Sep 21, 10:30 am, Nick Zadrozny <n...@onemorecloud.com> wrote:

Erwin Matthijssen

unread,
Oct 24, 2010, 7:28:01 AM10/24/10
to Sunspot
Hiya,

I've been trying to integrate sunspot (Isaac's fork) into my app and
index pdf files. Indexing of db fields works as expected, but trying
to index attachments creates trouble.

I asked Isaac for some guidance and he directed me here, saying Nick
could probably help me out.

My setup: app runs on heroku, files stored on S3 (handled by
Paperclip), websolr index

so in my model i'm doing this:

class Document < ActiveRecord::Base

belongs_to :documentable, :polymorphic => true

has_attached_file :content,
:storage => :s3,
:s3_credentials => "#{Rails.root}/config/s3.yml",
:path => "documents/:id/:basename.:extension"

before_post_process :image?

searchable do
text :name, :default_boost => 2, :stored => true
attachment :content
end

However, when I try to post an actual attachment to the document model
things go wrong.

In my local development environment (OSX) this results in a
lazy_loading error from Solr. Now this might very well be a local
environment issue (although I've follow all posts and tutorials
setting it up)..

In my production (on heroku/websolr) I get a different error and i've
included a link to the trace at the bottom.

I can't figure out why this isn't working, can you give any insight in
what I am
doing wrong here?

Hope you can point me in the right direction.

Kind regards,
Erwin

Gist with error trace from Heroku: http://gist.github.com/643442

Erwin Matthijssen

unread,
Oct 24, 2010, 7:19:56 AM10/24/10
to Sunspot
Hi all,

I've been trying to integrate sunspot (Isaac's fork) into my app and
index pdf files. Indexing of db fields works as expected, but trying
to index attachments creates trouble.

My setup: app runs on heroku, files stored on S3 (handled by
Paperclip), websolr index

so in my model i'm doing this:

class Document < ActiveRecord::Base

belongs_to :documentable, :polymorphic => true

has_attached_file :content,
:storage => :s3,
:s3_credentials => "#{Rails.root}/config/s3.yml",
:path => "documents/:id/:basename.:extension"

before_post_process :image?

searchable do
text :name, :default_boost => 2, :stored => true
attachment :content
end

However, when I try to post an actual attachment to the document model
things go wrong.

In my local development environment (OSX) this results in a
lazy_loading error from Solr. Now this might very well be a local
environment issue (although I've follow all posts and tutorials
setting it up)..I would still like my local setup to run aswell..

In my production (on heroku/websolr) I get a different error and i've
included a link to the trace at the bottom.

I can't figure out why this isn't working, can you give any insight in
what I am
doing wrong here?

Hope you can point me in the right direction.

Kind regards,
Erwin

Gist with error trace from Heroku: http://gist.github.com/643442


On 21 sep, 03:24, Isaac <isaackea...@gmail.com> wrote:

clyfe

unread,
Nov 3, 2010, 10:42:40 AM11/3/10
to Sunspot
Hello Erwin Matthijssen,

I believe the issue is that the Cell patch only works if the files are
accessible on the filesystem local to the solr server. If you use s3
it will not work. I am currently pondering on the same issue.

Erwin Matthijssen

unread,
Nov 3, 2010, 4:47:05 PM11/3/10
to ruby-s...@googlegroups.com
Hi Clyfe,

I can happily report that the cell patch *does* work with files that
are stored on S3. I've got that working in our local development
environment now (files on S3, solr local).

I'm still having trouble transplanting that solution to heroku/websolr
though, which is odd since the cell patched sunspot takes care of
posting the file contents to the solr server. This means websolr
doesn't even know (or care) where the files are stored..

The tinkering continues...

2010/11/3 clyfe <claudius...@gmail.com>:

> --
> You received this message because you are subscribed to the Google Groups "Sunspot" group.
> To post to this group, send email to ruby-s...@googlegroups.com.
> To unsubscribe from this group, send email to ruby-sunspot...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/ruby-sunspot?hl=en.
>
>

Erwin Matthijssen

unread,
Nov 3, 2010, 6:10:54 PM11/3/10
to ruby-s...@googlegroups.com
Success at last!

I can verify that Isaac's cell branch will work "out-of-the-box" with
Heroku and Websolr. The last part of the puzzle was updating the
schema at Websolr with the one provided with the cell branch (i'm a
solr noob, so this wasn't obvious to me).

Just make sure you pass the (authorized) URL to the file on s3 to the
attachment keyword in your searchable declaration:

searchable do
attachment :attached_file
end

def attached_file
.. insert code that generates an (authorized) URL to your file in
string format here.
end

Sometimes letting something rest for a week or so works wonders!


2010/11/3 clyfe <claudius...@gmail.com>:

clyfe

unread,
Nov 4, 2010, 1:15:25 PM11/4/10
to Sunspot, Erwin Matthijssen
Thank you for feedback, I got this working.
Reply all
Reply to author
Forward
0 new messages