Real Time Indexing Questions

Garrett Dimon

unread,

Feb 27, 2014, 5:47:56 PM2/27/14

to thinkin...@googlegroups.com, Wade Winningham

Howdy, Pat. We're in the process of upgrading from TS 2 with delayed deltas to TS 3 with real time indexing.

We've been able to get everything up and running, but we've run into a couple of problems/questions around the indexing. These may ultimately be Sphinx questions rather than Thinking Sphinx questions, but I thought I'd start here since we're only changing Sphinx from 2.0 to 2.1.

1. Index Creation Performance

Our production logs show about a 20 minute turnaround to do a complete reindex of our production data with TS 2. Running some local tests, TS 3 generate is taking at least an hour for that data. (The generate is crashing, so it may ultimately take even longer.) Our indexing configuration is setup so that a large portion of content is excluded from the index. (Inactive accounts, archived projects, etc.) We've verified that searching is correctly excluding the relevant records, but appears as if that's happening when the query is run rather than when the indexing occurs. Our only theory so far is that with TS 2 and traditional indexing, those weren't included in the index at all, but that with real time indexing, they're included in the index and filtered out when the query is run. Can you provide any insight about whether this sounds like normal behavior or whether we've likely screwed something up? :)

2. The Generate is crashing with "rake aborted! sphinxql: syntax error, unexpected $undefined, expecting CONST_INT (or 4 other tokens) near ''..." (where ... is content from our DB.)

I've done some searching, but haven't had any luck. I've run the generate rake task on two separate occasions, and both times it failed with the same error message and content, so my gut is leading me to think that it's an encoding or unescaped quotation mark problem. Does that problem ring any bells?

Thanks!

Pat Allan

unread,

Feb 27, 2014, 10:21:41 PM2/27/14

to thinkin...@googlegroups.com

Hi Garrett

Generation can be slow - at the end of the day, it really comes down to how much data you’re dealing with, and if you’re using aggregation methods, how quick they are. It’s all going through your Rails app (instead of just SQL queries), so optimising for that is different to adding db indices and such.

As for the error though… without having a copy of the database, it’s a little hard to debug, but it sounds like there’s a bug in TS with something in the data being passed through. Having a look at your app log may help identify the record in question… also, what does the index definition for that model look like?

—

Pat

--
You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphi...@googlegroups.com.
To post to this group, send email to thinkin...@googlegroups.com.
Visit this group at http://groups.google.com/group/thinking-sphinx.
For more options, visit https://groups.google.com/groups/opt_out.

Garrett Dimon

unread,

Feb 28, 2014, 10:09:37 AM2/28/14

to thinkin...@googlegroups.com

Roger on the generation. Any high-level suggestions for optimization?

I'll see if I can't figure out the exact record that's tripping up the index generation. In the meantime, here's a gist of our index definition:

https://gist.github.com/garrettdimon/25f6c305541f30b3ce39

--
You received this message because you are subscribed to a topic in the Google Groups "Thinking Sphinx" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/thinking-sphinx/7llAB4zO4bw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to thinking-sphi...@googlegroups.com.

Pat Allan

unread,

Feb 28, 2014, 10:21:07 AM2/28/14

to thinkin...@googlegroups.com

The `where` method doesn’t apply for real-time indices - but try this instead in the index definition block:

scope { Issue.where "account_status_id IN (1,2) AND archived IS false” }

That should ensure only the appropriate issues are indexed as part of the generate call.

Beyond that, you may wish to add `includes` within that scope to cover all associations used within the index?

Unrelated to any of this: the match_mode defaults to extended with TS v3 (indeed, it can’t be anything else).

—
Pat

Garrett Dimon

unread,

Feb 28, 2014, 10:46:44 AM2/28/14

to thinkin...@googlegroups.com

Thanks! We'll give that a shot. It seems the where *is* being applied at search time, but just not index-time. We're actually considering indexing everything with RT because previously when projects were unarchived or accounts unfrozen, they'd get picked backup with the full reindexes, but it seems with RT, that wouldn't happen because we don't run full re-indexes. Is there a way that we could tell TS to reindex in those situations if we use this approach and exclude them from the index?

I'm trying to track down the specific content that's tripping up the generation, but I'm not having much luck. I found the record that's printed out when rake files, and I manually inspected the records in nearby proximity to it, but they all check out. Is there a way/location for me to see the SQL query that's being used at the moment the rake task fails?

On Fri, Feb 28, 2014 at 9:21 AM, Pat Allan <p...@freelancing-gods.com> wrote:

The `where` method doesn't apply for real-time indices - but try this instead in the index definition block:

scope { Issue.where "account_status_id IN (1,2) AND archived IS false" }

That should ensure only the appropriate issues are indexed as part of the generate call.

Beyond that, you may wish to add `includes` within that scope to cover all associations used within the index?

Unrelated to any of this: the match_mode defaults to extended with TS v3 (indeed, it can't be anything else).

--

Pat

On 1 Mar 2014, at 2:09 am, Garrett Dimon <gar...@nextupdate.com> wrote:

> Roger on the generation. Any high-level suggestions for optimization?
>
> I'll see if I can't figure out the exact record that's tripping up the index generation. In the meantime, here's a gist of our index definition:
> https://gist.github.com/garrettdimon/25f6c305541f30b3ce39
>
>
> On Thu, Feb 27, 2014 at 9:21 PM, Pat Allan <p...@freelancing-gods.com> wrote:
> Hi Garrett
>
> Generation can be slow - at the end of the day, it really comes down to how much data you're dealing with, and if you're using aggregation methods, how quick they are. It's all going through your Rails app (instead of just SQL queries), so optimising for that is different to adding db indices and such.
>

> As for the error though... without having a copy of the database, it's a little hard to debug, but it sounds like there's a bug in TS with something in the data being passed through. Having a look at your app log may help identify the record in question... also, what does the index definition for that model look like?
>
> --

Garrett Dimon

unread,

Feb 28, 2014, 10:50:36 AM2/28/14

to thinkin...@googlegroups.com

One more question about adding "includes" to the index. If we're currently adding those includes when the query is run, is it accomplishing the same thing, or would adding them here potentially help the indexing speed as well?

When we search, the includes are conditional based on a variety of other factors, whereas here, it seems they'd be included every time even they weren't needed.

On Fri, Feb 28, 2014 at 9:21 AM, Pat Allan <p...@freelancing-gods.com> wrote:

The `where` method doesn't apply for real-time indices - but try this instead in the index definition block:

scope { Issue.where "account_status_id IN (1,2) AND archived IS false" }

That should ensure only the appropriate issues are indexed as part of the generate call.

Beyond that, you may wish to add `includes` within that scope to cover all associations used within the index?

Unrelated to any of this: the match_mode defaults to extended with TS v3 (indeed, it can't be anything else).

--

Pat

On 1 Mar 2014, at 2:09 am, Garrett Dimon <gar...@nextupdate.com> wrote:

> Roger on the generation. Any high-level suggestions for optimization?
>
> I'll see if I can't figure out the exact record that's tripping up the index generation. In the meantime, here's a gist of our index definition:
> https://gist.github.com/garrettdimon/25f6c305541f30b3ce39
>
>
> On Thu, Feb 27, 2014 at 9:21 PM, Pat Allan <p...@freelancing-gods.com> wrote:
> Hi Garrett
>
> Generation can be slow - at the end of the day, it really comes down to how much data you're dealing with, and if you're using aggregation methods, how quick they are. It's all going through your Rails app (instead of just SQL queries), so optimising for that is different to adding db indices and such.
>

> As for the error though... without having a copy of the database, it's a little hard to debug, but it sounds like there's a bug in TS with something in the data being passed through. Having a look at your app log may help identify the record in question... also, what does the index definition for that model look like?
>
> --

Pat Allan

unread,

Feb 28, 2014, 8:19:34 PM2/28/14

to thinkin...@googlegroups.com

From reading through the source a bit more, it seems the `where` conditions for a real-time index can be set to either a symbol (representing an instance method name within the indexed model) or a proc (called with a single argument, the instance of the model), and if any condition exists, each much return true for the object to be added to the real-time index.

As for monitoring SphinxQL statements that are being sent to Sphinx, there’s been a recent commit to TS that adds real-time updates to the logs. Adding the following to your Gemfile should sort that out:

gem ‘thinking-sphinx’, ‘~> 3.1.0’,
:git => ‘git://github.com/pat/thinking-sphinx.git',
:branch => ‘develop’,
:ref => '94ee176a7a’

Also: anything within the scope block (includes or otherwise) are only applied for the full index generation. They’re not used when updating the Sphinx indices for a single record is created/edited, nor do they apply during searches.

Garrett Dimon

unread,

Mar 1, 2014, 8:13:05 AM3/1/14

to thinkin...@googlegroups.com, Wade Winningham

I believe that I may have found the culprit. (The "\u0000" unicode null character.) By updating the scope of the index, I was able to narrow the failure down to the two offending issues and their comments and attempting to index just those issues. That made the failures quicker. Otherwise, I was having to wait 30+ minutes for the indexer to fail. That made it difficult to test out theories. It's obvious in hindsight, but it really helped get to the bottom of things.

We concatenate all of the comments using a "comment_bodies" method on the issue for indexing purposes. In both cases, one of the comments for the issue contains "\u0000". I had previously been using Sequel Pro to examine the contents of the comments, so I didn't see the character. Once I was able to definitively identify the offending comments, I checked the value of "comment_bodies" in console, and the \u0000 was visible.

I added a low-tech ".gsub(/\u0000/, '')" to our comment_bodies, and all was right again in indexing land. So, it appears that the \u0000 null unicode sequence is definitely the culprit. For now, we'll filter it out manually, but I expect that's something that Thinking Sphinx could handle.

I hope this helps!

Pat Allan

unread,

Mar 1, 2014, 8:52:29 AM3/1/14

to thinkin...@googlegroups.com, Wade Winningham

Thanks Garrett!

Just pushed a fix to the develop branch for this - if you want to update the ref in your Gemfile, it’s 19ea71dd90. Took the same approach as your fix, though I wish there was something more elegant (or Sphinx just handled it gracefully).

Reply all

Reply to author

Forward