Slow indexing (never finishes) when indexing 4 or more associations

287 views
Skip to first unread message

Steve Kenworthy

unread,
Jul 25, 2013, 12:54:08 AM7/25/13
to thinkin...@googlegroups.com
Hi there,

Firstly, thinking-sphinx is awesome and I love it. Thanks Pat for an excellent project. V3 is looking great and represents a lot of hard work and effort.

I've been using thinking-sphinx to index a document model and it's really slowed down when I add lots of associations in the index. In fact, it never finishes on my machine (8Gig RAM, 8 CPU's) when I add 4 indexes.

Times:
  • 4 seconds - when 1 association (images) is indexed
  • 6 seconds - when 2 associations (images and subscribers) are indexed
  • 23 seconds - when 2 associations (images and countries) are indexed
  • 115 seconds - when 3 associations (images, subscribers and tags) are indexed
  • 113 seconds - when 3 associations (images, subscribers and videos) are indexed (just to prove it's not tags slowing it down)
  • ꝏ (not finishing) - when 4 associations or more are selected.

Here's my index file:

ThinkingSphinx::Index.define :document, with: :active_record, delta: true, sql_range_step: 999999999, group_concat_max_len: 16384 do

  has countries(:id), as: :country_ids
  has images(:id), as: :image_ids, facet: true
  has subscribers(:id), as: :subscriber_ids, facet: true
  has tags(:id), as: :tag_ids, facet: true
  has videos(:id), as: :video_ids, facet: true

  indexes countries.name, as: :countries
  indexes images.title, as: :images
  indexes subscribers.title, as: :subscribers
  indexes tags.name, as: :tags
  indexes videos.title, as: :videos

  has updated_at

end

The generated sql is a massive group_by query and is not finishing. See it here https://github.com/crossroads/rails3-ts-example#what-sphinx-is-doing

I'd really appreciate some advice on how to optimise this so indexing becomes viable again. Do I just have too much going on here? I'm using facets, indexes and attributes. Perhaps there is a better way to optimise? A friend suggested pre-computing with some joins... how would this work?

Vital stats: using mysql v14.14, sphinx 2.0.4, Ubuntu, rails 3.2.13, thinking-sphinx 3.0.4

For those who'd like to take a look, I've uploaded a sample project here https://github.com/crossroads/rails3-ts-example which can be cloned. If you follow the instructions, it will setup a db with test data and reproduce the problem quickly.

There's also the sphinx generated SQL and EXPLAIN: https://github.com/crossroads/rails3-ts-example#what-sphinx-is-doing 

Thanks in advance for anyone taking the time to read.

Regards,
Steve

Pat Allan

unread,
Jul 25, 2013, 10:14:01 AM7/25/13
to thinkin...@googlegroups.com
Hi Steve

I've got a way forward to greatly improve the speed of indexing… unfortunately, it's not going to work within Thinking Sphinx easily right now.

Sphinx has the ability to gather attribute and field values from separate queries - this existed for TS v1/v2 for attributes, and fields was added in TS v3, but the catch is those separate queries don't work for HABTM joins. I'd love to change that, it's just painful from an ActiveRecord perspective because you're not dealing with a model's table as the base, but the HABTM join table.

Here's the configuration for the relevant source that I modified by hand:
https://gist.github.com/pat/6080031

You'll see that the main query is nice and short - and then there's each of the MVA and joined field definitions. If you put this in the generated source definition in config/development.sphinx.conf, and then run the indexer manually (NOT through the rake task, that'll overwrite this):
indexer --config config/development.sphinx.conf --all --rotate

(Remove --rotate if Sphinx isn't running.) You'll see it's pretty damn fast.

Now, ways forward? Well, I'd love to write something for TS v3 that can handle HABTM - it's just a shame that it might need to be pure ARel rather than ActiveRecord-built (which can otherwise help with joins).

But otherwise: switch from HABTM to has_many/has_many :through - make each of the joins an actual model. Then, you can add :source => :query to each of the appropriate field and attribute definitions, and it should generate something pretty much the same.

Hope this provides some clarity at the very least! And also: thanks for the test app, really helped with debugging!

--
Pat
> --
> You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphi...@googlegroups.com.
> To post to this group, send email to thinkin...@googlegroups.com.
> Visit this group at http://groups.google.com/group/thinking-sphinx.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>


Pat Allan

unread,
Jul 26, 2013, 8:57:38 AM7/26/13
to thinkin...@googlegroups.com
Heya Steve

Was just looking into how difficult this would be to implement properly, and noticed I have added the ability to take a string as the source query - instead of the column references. So, it's possible without hacking around in the index definition itself:

https://gist.github.com/pat/6088629

It's worth noting that the document id (Sphinx's equivalent of a primary key) involves the normal primary key with an offset and a multiplier. Make sure those two integers match what's in your generated index in sql_query. They may change when you add other indices to your app (depends on alphabetical order of your index files).

Also: there's probably some metaprogramming you could add to simplify things a bit more.

Would love to hear if this approach helps with your real app and not just the test one :)

--
Pat

Steve Kenworthy

unread,
Jul 26, 2013, 1:03:00 PM7/26/13
to thinkin...@googlegroups.com
Thanks Pat,

Two very useful and insightful replies. I sat down to analyse what you wrote just after you sent the second post and it's definitely worked very well so far. Indexing is much faster with the string source query, and I've worked out the correct multipliers (thanks for the tip).

I'll finish off the indexes and test the app to confirm but so far so good.

Thanks for that!
Regards,
Steve

jons...@gmail.com

unread,
Jun 28, 2015, 9:50:18 AM6/28/15
to thinkin...@googlegroups.com
Hi Pat,

I implemented according to this, and the indexing time went down (5 times faster on development). However, the delta indexing time went up (30 times slower on development). See below the indexing stats:

Total docsBytesTime (sec)Total docsBytesTime (sec)
incident_index_1_core7331653112239.436incident_index_6_core7331282395938.802
incident_index_1_delta611280.184incident_index_6_delta6247634255.234
incident_index_2_core7319675118945.477incident_index_7_core7319283317268.819
incident_index_2_delta58430.233incident_index_7_delta5247632895.321
incident_index_3_core7390680381442.064incident_index_8_core7390283101217.913
incident_index_3_delta821430.203incident_index_8_delta8247643665.282
incident_index_4_core7278637766437.665incident_index_9_core7278281622607.891
incident_index_4_delta611080.436incident_index_9_delta6247633305.456
incident_index_5_core7396660135839.704incident_index_10_core7396281520759.562
incident_index_5_delta69440.216incident_index_10_delta6247633085.303

Any idea why this is happening?

Thanks,
Jonathan

Pat Allan

unread,
Jun 28, 2015, 10:17:37 PM6/28/15
to thinkin...@googlegroups.com
Hi Jonathan

Can you share your index definitions so I can get a better idea of where the problem might be?

Also: which versions of Rails and Thinking Sphinx are you using?

— 
Pat

For more options, visit https://groups.google.com/d/optout.

jons...@gmail.com

unread,
Jun 29, 2015, 1:23:46 AM6/29/15
to thinkin...@googlegroups.com
Certainly. I shoulda attached them before.

To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphinx+unsub...@googlegroups.com.

jons...@gmail.com

unread,
Jun 29, 2015, 2:03:04 AM6/29/15
to thinkin...@googlegroups.com
Rails version: 4.1.7
TS version: 3.0.6


On Monday, June 29, 2015 at 5:17:37 AM UTC+3, Pat Allan wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphinx+unsub...@googlegroups.com.

jons...@gmail.com

unread,
Jun 29, 2015, 6:52:57 AM6/29/15
to thinkin...@googlegroups.com
I even less understand the number of bytes in delta indexes 6 - 10. Why does 1_delta contain 1128 bytes and 6_delta 24M? They're on the same records.

Pat Allan

unread,
Jun 29, 2015, 10:12:27 AM6/29/15
to thinkin...@googlegroups.com
I’m not sure why the sizes are so different, but I think the overall issue is related to the three attributes that have :source => :query.

I’d recommend making two changes to each of them:

* Add a condition to each query that filters by the appropriate incident ids (like you’re doing for the main query) so the results are sharded in the same way.
* Perhaps add a second SQL statement to each of those attributes (separated by a semi-colon), with :source set to :ranged_query, as covered in the Sphinx documentation:

The first of those isn’t too complex, so I’d start with that. Certainly the second is far more fiddly, but may be worthwhile.

Hope this helps!

— 
Pat

For more options, visit https://groups.google.com/d/optout.

jons...@gmail.com

unread,
Jun 30, 2015, 2:45:17 AM6/30/15
to thinkin...@googlegroups.com
Thanks, sharding the joined queries works. I'd also like to improve them for the deltas. Is there any way to add "WHERE delta = 1" to the joined queries in the delta definition?

Pat Allan

unread,
Jun 30, 2015, 3:00:13 AM6/30/15
to thinkin...@googlegroups.com
You’d have to end up with a fair bit of duplication, but it’s technically possible.

# creates both core and delta indices
ThinkingSphinx::Index.define(:article,
:with => :active_record,
:delta => ThinkingSphinx::Deltas::ResqueDelta
) do
# …
end

Is the equivalent of:

# create core index
ThinkingSphinx::Index.define(:article,
:with => :active_record,
:delta? => false,
:delta_processor => ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
# …
end

# create delta index
ThinkingSphinx::Index.define(:article,
:with => :active_record,
:delta? => true,
:delta_processor => ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
# …
end

The first being the core index, the second being the delta, with the same definition block normally being applied to both. If you want to have something slightly different in the delta index definition block, I guess you could try something along these lines?


Pat

jons...@gmail.com

unread,
Jul 8, 2015, 2:05:46 AM7/8/15
to thinkin...@googlegroups.com
Awesome, works like a charm! See the SO question that refers to this thread: http://stackoverflow.com/questions/30913789/thinking-sphinx-indexing-performance.
Reply all
Reply to author
Forward
0 new messages