Thinking Sphinx with multiple databases

251 views
Skip to first unread message

Spike

unread,
Nov 5, 2009, 9:02:17 PM11/5/09
to Thinking Sphinx
Hi there, from my code hunting and general searching I've pretty much
determined that thinking sphinx doesn't support indexing the same
models across a series of databases.

I have a CMS app and each clients data is stored in a separate
database for many reasons. When the rails app gets a connection it
looks at the hostname requested and switches databases accordingly. It
all works quote well except when it comes to search. I managed to get
Ferret to work with multiple databases by adding another key which was
the database name to the ferret index. I also had someone hack
ultrasphinx so that it would encode the database in the upper digits
of the sphinx index. I've decided to make the switch to thinking
sphinx because the setup for ultrasphinx is both complicated and fails
quite a lot (well the indexer does anyway).

Can someone point me in the right direction with regards to modifying
thinking sphinx to work with multiple databases? From what I can tell
most of the changes would be in the configuration.rb file where code
would have to be added to loop both databases and models and index
them appropriately.

Any tips would be greatly appreciated, and I would seriously consider
paying to get it added as an official feature of thinking sphinx.

Looking forward to hearing from you :)

Brendon

Pat Allan

unread,
Nov 5, 2009, 9:59:29 PM11/5/09
to thinkin...@googlegroups.com
Hi Brendon

Sounds like you want multiple sources - which, at a basic level, is in
Thinking Sphinx (though it's not publicised):
define_index do
define_source do
# usual field and attribute stuff here
end

define_source do
# other fields and attributes
end
end

I guess if you can use some metaprogramming to generate a source for
each client, then that would be a start... but there's complications
with Sphinx expecting a unique id per document across the entire
setup, so you'd need to customise the offset that TS generates.

Hopefully that gives you a headstart.

As for something official in Thinking Sphinx... I've only heard of
such a request once before, so I'm not sure if it's worth paying for.
If you insist, then I'll consider it :)
That said, given this is Thinking Sphinx month, I'm not running out of
bugs to fix and features to add.

--
Pat

Mike Gunderloy

unread,
Nov 6, 2009, 2:48:01 AM11/6/09
to thinkin...@googlegroups.com
We had a similar requirement on one project. We ended up tackling it
on two fronts. First, we ended up with a hacked-up configuration.rb
inside of Thinking Sphinx that uses an internal MultiSite object from
our code, which returns a string representing the current site. Here's
part of it:

@configuration.searchd.pid_file = "#{self.app_root}/log/
searchd.#{environment}.#{MultiSite.site}.pid"
@configuration.searchd.log = "#{self.app_root}/log/
searchd.#{MultiSite.site}.log"
@configuration.searchd.query_log = "#{self.app_root}/log/
searchd.query.#{MultiSite.site}.log"

You get the idea: all the configuration stuff is namespaced by site
name. Then we run one indexing daemon per site, so we actually have
completely separate indexes for each of the sites (all of which have
the exact same database schema, but different database instances, an
architecture necessitated by some fairly strict business rules about
what we can physically share between customers). Then, when an
incoming request comes in that needs to use Sphinx for searching, we
play games in the controller code to hook up to the right index:

ThinkingSphinx::Configuration.instance.port = MultiSite.prop
(:searchd_port)

That points TS at an instance of sphinx whose port is keyed off the
MultiSite object, and then it's off to the races to make .search calls
to find data.

Hope that helps, or at least gives you an architecture to start from.

Mike

Spike

unread,
Nov 9, 2009, 4:46:25 AM11/9/09
to Thinking Sphinx
Hehe, thanks Pat :)

The source thing sounds good. Our ultrasphinx mod used bitshifting in
the upper range of the index for each record to identify which
database the record belonged to. I'm not sure if thinking sphinx does
a similar thing with regards to encoding the class of an object in the
index id? I'm also not sure of the details of how it was done in US
since I didn't do it, but maybe i'll learn something by
investigating :D

What approximate cost would you assign to a modification of this type?
Us poor New Zealanders are at the mercy of harsh currency exchange
rates but I'd be interested to know none the less :D We have about 100
schools using our CMS system so the benefit to them (and to us for a
non-crashing, anti-not-working search engine is pretty obvious) :D

Cheers for your help :)

Brendon

Spike

unread,
Nov 9, 2009, 4:51:32 AM11/9/09
to Thinking Sphinx
Thanks Mike,

Sounds like a good idea, but I'm a bit concerned because of the number
of search instances that we'd be required to run. As I said in the
above reply we have about 100 schools, so that would mean having to
run 100 search daemons? I'm not sure of the details of how these
daemons work when they're not doing anything but I'd presume this
would be a slight tax on the system?

I'd be interested to hear how many search instances you were
running. :)

Cheers,

Brendon

Pat Allan

unread,
Nov 11, 2009, 11:17:01 PM11/11/09
to thinkin...@googlegroups.com
Hi Brendon

I'm not sure what Sphinx's base memory usage is with no documents
(it's worth investigating), but I know that all attribute values for
all documents are stored in memory - and so, the only difference
between one instance and ten is that base memory usage - you'll still
have the documents' attributes (split up over multiple instances or
all in one) in memory anyway.

--
Pat

Pat Allan

unread,
Nov 11, 2009, 11:19:56 PM11/11/09
to thinkin...@googlegroups.com
I'm also not entirely sure how US did it - although I may have stolen
their approach for TS. It's been a while. Thinking Sphinx calculates
an offset on the primary key, based on the number of indexed models,
and the array index of current model within that array of all indexed
models, to ensure each document id is unique. If you look at the
generated SQL in your sphinx.conf files, this should be reasonably
clear. I guess you could factor in an application id into the mix as
well.

As for costs, I'm not sure - I'm in Australia though, so hopefully the
exchange rate doesn't jump around too much. Either way, if you'd like
me to sketch out a rough timeline and cost, let's continue this
discussion offlist :)

Cheers

--
Pat

Brendon Muir

unread,
Nov 25, 2009, 6:59:55 PM11/25/09
to thinkin...@googlegroups.com
Thanks Pat :) Sorry for the slow reply.

Probably the biggest nightmare would be handling all the processes. I'm sure I could script it all up but it'd still leave me feeling uneasy. Like I used to feel with mongrel clusters :D

Hopefully a one process system will be possible :)

Cheers,

Brendon

--~--~---------~--~----~------------~-------~--~----~
 This message is part of the topic "Thinking Sphinx with multiple
databases" in the Google Group "Thinking Sphinx" for which you requested
email updates.
To stop receiving email updates for this topic, please visit the topic
at http://groups.google.com/group/thinking-sphinx/t/beb3931140893547
-~----------~----~----~----~------~----~------~--~---


Reply all
Reply to author
Forward
0 new messages