In memory ruby full text search

272 views
Skip to first unread message

Mikel Lindsaar

unread,
Jan 2, 2014, 6:27:10 PM1/2/14
to rails-...@googlegroups.com
Hi there fellow Railers,

Have a requirement for full text search on a small static set of documents in 8 languages. The documents are committed to the repository don't change outside of a deploy. Talking about 2mb total inclusive of all languages across about 20 HTML documents per language.

I want to get full text search happening on these documents, with weighted results and simple AND / OR type matching.

Obviously, using something like Sphinx / Postgres full text search would handle it, but feels like over kill to spin up a separate search server instance to manage and index.

Using something lix xapian is an option, could build the index on app boot, but needs packages installed on the server (trying to do this simply)

Anyone know of a xapian like ruby full text search that can run in memory of off temp files that doesn't have external dependencies?

I've done some googling and can't find anything that really suits. I think the simplest thing might be building PostgreSQL full text search tables on app boot and using them as the app is already using PostgreSQL.

But I welcome other ideas if they exist :)

Mikel

Chris Berkhout

unread,
Jan 2, 2014, 6:54:35 PM1/2/14
to rails-...@googlegroups.com

Check out picky:
http://florianhanke.com/picky/

--
You received this message because you are subscribed to the Google Groups "Ruby or Rails Oceania" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rails-oceani...@googlegroups.com.
To post to this group, send email to rails-...@googlegroups.com.
Visit this group at http://groups.google.com/group/rails-oceania.
For more options, visit https://groups.google.com/groups/opt_out.

Chris Berkhout

unread,
Jan 2, 2014, 7:01:35 PM1/2/14
to rails-...@googlegroups.com

Actually, if its just text and no categories or anything picky may not suit. I believe it does let you create your own indexes easily and choose the back end (say, in memory), but is aimed at categorized data.

Lightweight indexing might be an easier task than finding something lightweight for stemming in those 8 languages.

Andrew Harvey

unread,
Jan 2, 2014, 7:19:24 PM1/2/14
to rails-...@googlegroups.com

Way back in the day I used ferret, which is a pure ruby implementation of Lucene. I moved away from it because scaling was a real problem. It may be a good tool for the job, though I'd be surprised if it was still maintained.

I don't know of any other options, other than maybe JRuby/Lucene.

A.

Ben Hoskings

unread,
Jan 2, 2014, 10:30:11 PM1/2/14
to rails-...@googlegroups.com
My first stop would be to load them into a postgres DB in a deploy task, and then search the data from there.

Also, I wouldn't worry about the 'in memory' part; the kernel will take care of keeping the data in memory via the page cache as long as the box has enough memory.

- Ben

Mikel Lindsaar

unread,
Jan 3, 2014, 1:32:09 AM1/3/14
to rails-...@googlegroups.com
On 3 Jan 2014, at 2:30 pm, Ben Hoskings <b...@hoskings.net> wrote:
> My first stop would be to load them into a postgres DB in a deploy task, and then search the data from there.

Yeah, as mentioned, that's looking like the "simplest" option.

> Also, I wouldn't worry about the 'in memory' part; the kernel will take care of keeping the data in memory via the page cache as long as the box has enough memory.

Yeah, I'm not worried about it being "in memory", I was more talking about "in memory" as a way of saying "not a client server model"... perhaps "in process" would have been a better way of saying it :)

Thanks :)

Mikel

Chris Ixion

unread,
Jan 3, 2014, 9:25:00 AM1/3/14
to rails-...@googlegroups.com
Elastic Search is pretty awesome for this sort of stuff, but introduces dependencies, so possibly no help.

http://www.elasticsearch.org/overview/

Cheers,

Chris

Ben Schwarz

unread,
Jan 3, 2014, 7:09:46 PM1/3/14
to rails-...@googlegroups.com
Interesting problem… Where are the documents being stored? Would there be benefit in the documents being served via an API once they've been "found"? 
Perhaps couchdb or some other document store with an index running over the top (lucene for couch) could be a good fit? 

You're going to want something pretty robust / well utilised to get solid word stemming / language support, so you'd want to utilise a bonafide search service. (Not Rubby)  
Also, you mentioned that the content is HTML — can be stored in markdown or something with less… structure? There could be challenges here too! 

--

Julio Cesar Ody

unread,
Jan 3, 2014, 7:32:25 PM1/3/14
to rails-...@googlegroups.com
Andrew mentioned Ferret before. It crossed my mind first thing, then I went to check on the GitHub repo (which is just an import from the old Trac repo), and it hasn’t been touched for years now.

It had some concurrency issues, which acts_as_ferret failed to address. Since that was the plugin that popularised the lib, the lib got a bad rep. Ferret itself was beyond rad. I wrote a search engine for a startup I used to work at back in the day using it, which was a non-Rails thing. Worked like a charm.

The author was a super smart guy (and nice) from Straaya. We tried to hire him then, but he was about to move to Japan to further his Judo studies. Never heard from him again.

(Cool story, bro)

Mikel Lindsaar

unread,
Jan 12, 2014, 1:42:42 AM1/12/14
to rails-...@googlegroups.com
Thanks all for the replies :)

Ben, they are documents from a translation system, about 40 documents in 9, soon to be 16+ languages. Once done, they really don't change much and building a big external search system for 40 small documents seemed to be engineering overkill.  The search is done via the web app itself.

The solution I implemented ended up being making a rake task to upload the static set of documents into a small postgresql table and using postgres full text search via textacular.  Might improve this in the future, but for now works really well.

It added no further dependencies and works really well and really fast.

Works well :)

Mikel
Reply all
Reply to author
Forward
0 new messages