Slow indexing (never finishes) when indexing 4 or more associations

Steve Kenworthy

unread,

Jul 25, 2013, 12:54:08 AM7/25/13

to thinkin...@googlegroups.com

Hi there,

Firstly, thinking-sphinx is awesome and I love it. Thanks Pat for an excellent project. V3 is looking great and represents a lot of hard work and effort.

I've been using thinking-sphinx to index a document model and it's really slowed down when I add lots of associations in the index. In fact, it never finishes on my machine (8Gig RAM, 8 CPU's) when I add 4 indexes.

Times:

4 seconds - when 1 association (images) is indexed
6 seconds - when 2 associations (images and subscribers) are indexed
23 seconds - when 2 associations (images and countries) are indexed
115 seconds - when 3 associations (images, subscribers and tags) are indexed
113 seconds - when 3 associations (images, subscribers and videos) are indexed (just to prove it's not tags slowing it down)
ꝏ (not finishing) - when 4 associations or more are selected.

Here's my index file:

ThinkingSphinx::Index.define :document, with: :active_record, delta: true, sql_range_step: 999999999, group_concat_max_len: 16384 do

has countries(:id), as: :country_ids

has images(:id), as: :image_ids, facet: true

has subscribers(:id), as: :subscriber_ids, facet: true

has tags(:id), as: :tag_ids, facet: true

has videos(:id), as: :video_ids, facet: true

indexes countries.name, as: :countries

indexes images.title, as: :images

indexes subscribers.title, as: :subscribers

indexes tags.name, as: :tags

indexes videos.title, as: :videos

has updated_at

end

The generated sql is a massive group_by query and is not finishing. See it here https://github.com/crossroads/rails3-ts-example#what-sphinx-is-doing

I'd really appreciate some advice on how to optimise this so indexing becomes viable again. Do I just have too much going on here? I'm using facets, indexes and attributes. Perhaps there is a better way to optimise? A friend suggested pre-computing with some joins... how would this work?

Vital stats: using mysql v14.14, sphinx 2.0.4, Ubuntu, rails 3.2.13, thinking-sphinx 3.0.4

For those who'd like to take a look, I've uploaded a sample project here https://github.com/crossroads/rails3-ts-example which can be cloned. If you follow the instructions, it will setup a db with test data and reproduce the problem quickly.

There's also the sphinx generated SQL and EXPLAIN: https://github.com/crossroads/rails3-ts-example#what-sphinx-is-doing

Thanks in advance for anyone taking the time to read.

Regards,

Steve

Pat Allan

unread,

Jul 25, 2013, 10:14:01 AM7/25/13

to thinkin...@googlegroups.com

Hi Steve

I've got a way forward to greatly improve the speed of indexing… unfortunately, it's not going to work within Thinking Sphinx easily right now.

Sphinx has the ability to gather attribute and field values from separate queries - this existed for TS v1/v2 for attributes, and fields was added in TS v3, but the catch is those separate queries don't work for HABTM joins. I'd love to change that, it's just painful from an ActiveRecord perspective because you're not dealing with a model's table as the base, but the HABTM join table.

Here's the configuration for the relevant source that I modified by hand:
https://gist.github.com/pat/6080031

You'll see that the main query is nice and short - and then there's each of the MVA and joined field definitions. If you put this in the generated source definition in config/development.sphinx.conf, and then run the indexer manually (NOT through the rake task, that'll overwrite this):
indexer --config config/development.sphinx.conf --all --rotate

(Remove --rotate if Sphinx isn't running.) You'll see it's pretty damn fast.

Now, ways forward? Well, I'd love to write something for TS v3 that can handle HABTM - it's just a shame that it might need to be pure ARel rather than ActiveRecord-built (which can otherwise help with joins).

But otherwise: switch from HABTM to has_many/has_many :through - make each of the joins an actual model. Then, you can add :source => :query to each of the appropriate field and attribute definitions, and it should generate something pretty much the same.

Hope this provides some clarity at the very least! And also: thanks for the test app, really helped with debugging!

--
Pat

> --
> You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphi...@googlegroups.com.
> To post to this group, send email to thinkin...@googlegroups.com.
> Visit this group at http://groups.google.com/group/thinking-sphinx.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Pat Allan

unread,

Jul 26, 2013, 8:57:38 AM7/26/13

to thinkin...@googlegroups.com

Heya Steve

Was just looking into how difficult this would be to implement properly, and noticed I have added the ability to take a string as the source query - instead of the column references. So, it's possible without hacking around in the index definition itself:

https://gist.github.com/pat/6088629

It's worth noting that the document id (Sphinx's equivalent of a primary key) involves the normal primary key with an offset and a multiplier. Make sure those two integers match what's in your generated index in sql_query. They may change when you add other indices to your app (depends on alphabetical order of your index files).

Also: there's probably some metaprogramming you could add to simplify things a bit more.

Would love to hear if this approach helps with your real app and not just the test one :)

--
Pat

Steve Kenworthy

unread,

Jul 26, 2013, 1:03:00 PM7/26/13

to thinkin...@googlegroups.com

Thanks Pat,

Two very useful and insightful replies. I sat down to analyse what you wrote just after you sent the second post and it's definitely worked very well so far. Indexing is much faster with the string source query, and I've worked out the correct multipliers (thanks for the tip).

I'll finish off the indexes and test the app to confirm but so far so good.

Thanks for that!

Regards,

Steve

jons...@gmail.com

unread,

Jun 28, 2015, 9:50:18 AM6/28/15

to thinkin...@googlegroups.com

Hi Pat,

I implemented according to this, and the indexing time went down (5 times faster on development). However, the delta indexing time went up (30 times slower on development). See below the indexing stats:

	Total docs	Bytes	Time (sec)		Total docs	Bytes	Time (sec)
incident_index_1_core	7331	6531122	39.436	incident_index_6_core	7331	28239593	8.802
incident_index_1_delta	6	1128	0.184	incident_index_6_delta	6	24763425	5.234
incident_index_2_core	7319	6751189	45.477	incident_index_7_core	7319	28331726	8.819
incident_index_2_delta	5	843	0.233	incident_index_7_delta	5	24763289	5.321
incident_index_3_core	7390	6803814	42.064	incident_index_8_core	7390	28310121	7.913
incident_index_3_delta	8	2143	0.203	incident_index_8_delta	8	24764366	5.282
incident_index_4_core	7278	6377664	37.665	incident_index_9_core	7278	28162260	7.891
incident_index_4_delta	6	1108	0.436	incident_index_9_delta	6	24763330	5.456
incident_index_5_core	7396	6601358	39.704	incident_index_10_core	7396	28152075	9.562
incident_index_5_delta	6	944	0.216	incident_index_10_delta	6	24763308	5.303

Any idea why this is happening?

Thanks,

Jonathan

Pat Allan

unread,

Jun 28, 2015, 10:17:37 PM6/28/15

to thinkin...@googlegroups.com

Hi Jonathan

Can you share your index definitions so I can get a better idea of where the problem might be?

Also: which versions of Rails and Thinking Sphinx are you using?

—

Pat

For more options, visit https://groups.google.com/d/optout.

jons...@gmail.com

unread,

Jun 29, 2015, 1:23:46 AM6/29/15

to thinkin...@googlegroups.com

Certainly. I shoulda attached them before.

https://gist.github.com/jonsgold/a3c41dbc9af44295c91f

To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphinx+unsub...@googlegroups.com.

jons...@gmail.com

unread,

Jun 29, 2015, 2:03:04 AM6/29/15

to thinkin...@googlegroups.com

Rails version: 4.1.7

TS version: 3.0.6

On Monday, June 29, 2015 at 5:17:37 AM UTC+3, Pat Allan wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphinx+unsub...@googlegroups.com.

jons...@gmail.com

unread,

Jun 29, 2015, 6:52:57 AM6/29/15

to thinkin...@googlegroups.com

I even less understand the number of bytes in delta indexes 6 - 10. Why does 1_delta contain 1128 bytes and 6_delta 24M? They're on the same records.

Pat Allan

unread,

Jun 29, 2015, 10:12:27 AM6/29/15

to thinkin...@googlegroups.com

I’m not sure why the sizes are so different, but I think the overall issue is related to the three attributes that have :source => :query.

I’d recommend making two changes to each of them:

* Add a condition to each query that filters by the appropriate incident ids (like you’re doing for the main query) so the results are sharded in the same way.

* Perhaps add a second SQL statement to each of those attributes (separated by a semi-colon), with :source set to :ranged_query, as covered in the Sphinx documentation:

http://sphinxsearch.com/docs/current.html#conf-sql-attr-multi

The first of those isn’t too complex, so I’d start with that. Certainly the second is far more fiddly, but may be worthwhile.

Hope this helps!

—

Pat

For more options, visit https://groups.google.com/d/optout.

jons...@gmail.com

unread,

Jun 30, 2015, 2:45:17 AM6/30/15

to thinkin...@googlegroups.com

Thanks, sharding the joined queries works. I'd also like to improve them for the deltas. Is there any way to add "WHERE delta = 1" to the joined queries in the delta definition?

Pat Allan

unread,

Jun 30, 2015, 3:00:13 AM6/30/15

to thinkin...@googlegroups.com

You’d have to end up with a fair bit of duplication, but it’s technically possible.

# creates both core and delta indices
ThinkingSphinx::Index.define(:article,
:with => :active_record,
:delta => ThinkingSphinx::Deltas::ResqueDelta
) do
# …
end

Is the equivalent of:

# create core index
ThinkingSphinx::Index.define(:article,
:with => :active_record,
:delta? => false,
:delta_processor => ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
# …
end

# create delta index
ThinkingSphinx::Index.define(:article,
:with => :active_record,
:delta? => true,
:delta_processor => ThinkingSphinx::Deltas.processor_for(ThinkingSphinx::Deltas::ResqueDelta)
) do
# …
end

The first being the core index, the second being the delta, with the same definition block normally being applied to both. If you want to have something slightly different in the delta index definition block, I guess you could try something along these lines?

—
Pat

jons...@gmail.com

unread,

Jul 8, 2015, 2:05:46 AM7/8/15

to thinkin...@googlegroups.com

Awesome, works like a charm! See the SO question that refers to this thread: http://stackoverflow.com/questions/30913789/thinking-sphinx-indexing-performance.

Reply all

Reply to author

Forward