Patch: improving indexing-performance of MVA through a separate query

90 views
Skip to first unread message

Menno van der Sman

unread,
Feb 12, 2009, 11:11:27 AM2/12/09
to Thinking Sphinx
Hi there,

I've run into some scaling problems in the way the indexer handles
long lists of multi-valued-attributes. Worst case scenario I have
items with over 25000 attributes attached. Indexing these through a
left-join with group_concat took a long time and caused quite some
load on the database.

Reading up on the sphinx-documentation I found that multi-valued-
attributes could also be indexed through a separate query that simply
retrieves all the <document, attribute>-pairs. A quick test showed
that this speeds up the indexing tremendously.

This feature isn't supported by thinking-sphinx so I took a stab at it
in my fork at http://github.com/menno/thinking-sphinx/commits/mva

It's tested in production for my use case which is along the line of
Item.has_many :tags, :through => :taggings. For which it can "select
item_id, tag_id from taggings" to get all the pairs. There are specs
and code for other has-many-associations but they, and other cases,
haven't been thoroughly tested.

Another point of concern is that I needed access to the unique-id-
expression used in the select-query to match up the ids. I've moved
this logic to ThinkingSphinx.unique_id_expression(offset) but I still
needed to pass around the offset a lot more than I'd like.

So I hope this can be of use to anyone, and feel free to comment on
the implementation/tests as it's my first encounter with the internals
of thinking-sphinx, cucumber and rspec ;)

Cheers,

Menno van der Sman

Pat Allan

unread,
Feb 16, 2009, 2:41:43 AM2/16/09
to thinkin...@googlegroups.com
Hi Menno

Thanks for the patch - what you've done is great, and I'll definitely
roll it into my master when I have the chance (although not entirely
sure when that'll be). It's great to know that it provides some
significant speed-ups too.

Cheers

--
Pat

Piotr Sarnacki

unread,
Feb 28, 2009, 9:10:48 AM2/28/09
to Thinking Sphinx
I have a question about using :souce => :query, :type => :multi. I've
changed my model to use it with all multi values and indexing takes
much less time but I'm wondering if there are any drawbacks? Are there
negative sides of using sql_attr_multi with source query?

Menno: thanks for your patch :)

On Feb 16, 8:41 am, Pat Allan <p...@freelancing-gods.com> wrote:
> Hi Menno
>
> Thanks for the patch - what you've done is great, and I'll definitely  
> roll it into my master when I have the chance (although not entirely  
> sure when that'll be). It's great to know that it provides some  
> significant speed-ups too.
>
> Cheers
>
> --
> Pat
>
> On 13/02/2009, at 3:11 AM, Menno van der Sman wrote:
>
>
>
> > Hi there,
>
> > I've run into some scaling problems in the way the indexer handles
> > long lists of multi-valued-attributes. Worst case scenario I have
> > items with over 25000 attributes attached. Indexing these through a
> > left-join with group_concat took a long time and caused quite some
> > load on the database.
>
> > Reading up on the sphinx-documentation I found that multi-valued-
> > attributes could also be indexed through a separate query that simply
> > retrieves all the <document, attribute>-pairs. A quick test showed
> > that this speeds up the indexing tremendously.
>
> > This feature isn't supported by thinking-sphinx so I took a stab at it
> > in my fork athttp://github.com/menno/thinking-sphinx/commits/mva

Pat Allan

unread,
Mar 3, 2009, 3:23:56 AM3/3/09
to thinkin...@googlegroups.com
Hi Piotr

As far as I can tell, there's no drawbacks with generating MVA results
via a source query. Will try to look at the patches in the next few
days.

Cheers

--
Pat

Menno van der Sman

unread,
Mar 10, 2009, 6:22:51 AM3/10/09
to Thinking Sphinx
Hi Pat,

Thanks for merging the patches. Some nice refactoring going on as
well ;)

I also noticed the commit that allows the custom query for MVAs. I
thought of the same when doing the initial performance testing but I
abandoned it because, well, it doesn't work. The foreign-key needs to
be offsetted to match the unique-id for the model that's being
indexed. I've put up a simple scenario at
http://github.com/menno/thinking-sphinx/commit/edce4901ed27afed4a57f05181753dd2d1738c07
to show that it fails.

As far as fixing it; this would require some nasty detecting of the
foreign-key and replacing it with the unique-id-expression.

Cheers,

Menno

Pat Allan

unread,
Mar 10, 2009, 9:04:45 AM3/10/09
to thinkin...@googlegroups.com
Hi Menno

You're right, it's not perfect - but I figure anyone who wants to
specify a manual query should be ready to add the offset. I don't
think it's worth even attempting figuring out what the foreign key is
- this is an edge case, and people should be ready to work for these
things if they're going outside the bounds of the vast majority of
ActiveRecord usage.

That said, if anyone wants to patch it with a neat trick to help with
an offset, be my guest :)

--
Pat
Reply all
Reply to author
Forward
0 new messages