Sphinx collected docs... but not enough!

55 views
Skip to first unread message

Steve H

unread,
Dec 13, 2009, 9:44:50 PM12/13/09
to Thinking Sphinx
OK first off -- I'm not 100% certain that 1 doc = 1 record. But if it
DOES then, sphinx definitely didn't collect enough. We have over 13
million rows in our database, but this was the output after indexing:

-----------------------------------------------
indexing index 'business_core'...
collected 1199 docs, 0.7 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.1 Mhits, 100.0% done
total 1199 docs, 685923 bytes
total 1381.345 sec, 496.56 bytes/sec, 0.87 docs/sec
distributed index 'business' can not be directly indexed; skipping.
Started successfully (pid 32011).
-----------------------------------------------

After reading some of the docs, I thought... maybe I have to increase
the sql_range_step in the sphinx.yml or increase the
group_concat_max_len property. So I did those things, but the output
after indexing is the same (albeit much much faster :)

-----------------------------------------------
Stopped search daemon (pid 2810).
Generating Configuration to /home/steve/insiderpages/config/
sphinx_dev.sphinx.conf
Sphinx 0.9.8.1-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/home/steve/insiderpages/config/
sphinx_dev.sphinx.conf'...
indexing index 'business_core'...
collected 1199 docs, 0.1 MB
collected 0 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 1199 docs, 79605 bytes
total 0.447 sec, 177901.38 bytes/sec, 2679.53 docs/sec
distributed index 'business' can not be directly indexed; skipping.
Started successfully (pid 6365).
-----------------------------------------------


What's the problem?

-Steve

PS:
mysql> select count(id) from businesses;
+-----------+
| count(id) |
+-----------+
| 13395683 |
+-----------+
1 row in set (0.00 sec)

mysql> select max(id) from businesses;
+-------------+
| max(id) |
+-------------+
| 15250646126 |
+-------------+
1 row in set (0.00 sec)

Pat Allan

unread,
Dec 13, 2009, 10:03:42 PM12/13/09
to thinkin...@googlegroups.com
Hi Steve

What does your define_index block for Business look like?

--
Pat
> --
>
> You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
> To post to this group, send email to thinkin...@googlegroups.com.
> To unsubscribe from this group, send email to thinking-sphi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=en.
>
>

Steve H

unread,
Dec 14, 2009, 11:26:10 AM12/14/09
to Thinking Sphinx
Thanks for the lightning-fast reply Pat. :)

define_index do
indexes :name, :sortable => true
indexes categories_and_tags(:name), :as
=> :category_names, :sortable => true
indexes [city.city, city.state_code], :as => :city, :sortable =>
true
set_property :group_concat_max_len => 8192
# attributes
has lat, lon
end

Every business has at least one category or tag, but I'm kind of
assuming here that the indexer is doing a full left join (ie, not an
inner join).

Could something in this block be causing the problem? And is 1 doc =
1 row in the db?

-Steve

Pat Allan

unread,
Dec 14, 2009, 5:58:41 PM12/14/09
to thinkin...@googlegroups.com
Hi Steve

Thinking Sphinx uses left outer joins (because that's what ActiveRecord uses, and it's the best fit in this situation), and yes, one doc is one row.

There's nothing odd in your index definition - so let's have a look at the sql_query value of the source business_core_0, in config/development.sphinx.conf (or production - they should be the same). You should be able to run it manually, see if it returns the right results (perhaps over a particular step, to keep things reasonably fast and manageable).

--
Pat

Steve H

unread,
Dec 14, 2009, 7:24:04 PM12/14/09
to Thinking Sphinx
Hi Pat,

Here is the sql_query:

sql_query = SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` ,
CAST(`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP_CONCAT
(DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS CHAR)
AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`,
`city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS
`sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS
`subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`,
'') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL
(`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS
`category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), IFNULL
(`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS
`lat`, `businesses`.`lon` AS `lon` FROM `businesses` LEFT OUTER
JOIN `categorizations` ON (`businesses`.`id` =
`categorizations`.`business_id`) LEFT OUTER JOIN `cs_categories` ON
(`cs_categories`.`id` = `categorizations`.`cs_category_id`) LEFT
OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id WHERE
`businesses`.`id` >= $start AND `businesses`.`id` <= $end GROUP BY
`businesses`.`id` ORDER BY NULL
sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1)
FROM `businesses`

We turned on logging for every query in mysqld to troubleshoot this.
This is the last query run:

SELECT SQL_NO_CACHE `businesses`.`id` * 1 + 0 AS `id` , CAST
(`businesses`.`name` AS CHAR) AS `name`, CAST(GROUP
_CONCAT(DISTINCT IFNULL(`cs_categories`.`name`, '0') SEPARATOR ' ') AS
CHAR) AS `category_names`, CAST(CONCAT_WS(' ', `city`.`city`,
`city`.`state_code`) AS CHAR) AS `city`, `businesses`.`id` AS
`sphinx_internal_id`, 4054125678 AS `class_crc`, '4054125678' AS
`subclass_crcs`, 0 AS `sphinx_deleted`, IFNULL(`businesses`.`name`,
'') AS `name_sort`, GROUP_CONCAT(DISTINCT IFNULL(IFNULL
(`cs_categories`.`name`, ''), '0') SEPARATOR ' ') AS
`category_names_sort`, CONCAT_WS(' ', IFNULL(`city`.`city`, ''), IFNULL
(`city`.`state_code`, '')) AS `city_sort`, `businesses`.`lat` AS
`lat`, `businesses`.`lon` AS `lon` FROM `businesses` LEFT OUTER
JOIN `categorizations` ON (`businesses`.`id` =
`categorizations`.`business_id`) LEFT OUTER JOIN `cs_categories` ON
(`cs_categories`.`id` = `categorizations`.`cs_category_id`) LEFT
OUTER JOIN `city` ON `city`.city_id = `businesses`.city_id WHERE
`businesses`.`id` >= 2365000001 AND `businesses`.`id` <= 2365744349
GROUP BY `businesses`.`id` ORDER BY NULL

Which returns zero rows. In fact, the last 15+ queries return zero
rows. The businesses table has no records with ids
2034010806...2500480805. And of course, the maximum id is
15250646237. We need *all* the rows indexed, of course... :)

So should we be increasing the step_size to like 10 million?

-Steve

Pat Allan

unread,
Dec 14, 2009, 7:32:42 PM12/14/09
to thinkin...@googlegroups.com
Hmm, if you run the sql_query_range query yourself, what does it return?

--
Pat

Steve H

unread,
Dec 14, 2009, 7:35:30 PM12/14/09
to Thinking Sphinx
mysql> SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1)
-> FROM `businesses`;
+----------------------+----------------------+
| IFNULL(MIN(`id`), 1) | IFNULL(MAX(`id`), 1) |
+----------------------+----------------------+
| 1 | 15250646237 |
+----------------------+----------------------+

Steve H

unread,
Dec 14, 2009, 7:40:52 PM12/14/09
to Thinking Sphinx
I found one thing very strange... no matter what the sql_range_step is
set to (default - 10,000,000), the last query always has an upper
limit of 2365744349. My very astute manager noticed that this is kind
of in the neighborhood of the largest possible value for an int.
Maybe that has something to do with it?

-Steve

Pat Allan

unread,
Dec 14, 2009, 7:53:21 PM12/14/09
to thinkin...@googlegroups.com
Yeah, that's my thought too - was wondering if it was MySQL at fault, or Sphinx.

Sphinx 0.9.9 (and maybe 0.9.8) has a --enable-id64 flag when you configure - I think it's worth recompiling and trying indexing again.

--
Pat

Steve H

unread,
Dec 14, 2009, 8:30:00 PM12/14/09
to Thinking Sphinx
Thanks Pat!! That certainly did the trick. ~13.5 million businesses
with their associated city and categories indexed in about 15
minutes. Of course the indexing was being done on the same box that
the mysqld was being run on, but still -- pretty dang fast.

Thanks again! Now for the fun stuff: actual implementation,
deployment concerns, etc.

-Steve
> ...
>
> read more »

Steve H

unread,
Jan 4, 2010, 8:00:21 PM1/4/10
to Thinking Sphinx
Pat --

For you and anyone else who may run into this issue:
We found that while the indexer successfully indexed all our models,
the sphinx_internal_id was wrong for all models with large ids
(anything 11 digits or more). This, of course, caused results brought
back to be not found in the database. We were wondering why the
"document" number returned by the search binary couldn't be used
instead of sphinx_internal_id... but that aside, we found that
sphinx_attr_bigint was introduced with 0.9.9, so we will be (upgrading
and) modifying our generated config file appropriately.

Is there a way to specify that the sphinx_internal_id should be a
sphinx_attr_bigint in the sphinx.yml file?

Thanks,
Steve

> ...
>
> read more »

Pat Allan

unread,
Jan 5, 2010, 3:36:02 AM1/5/10
to thinkin...@googlegroups.com
Hi Steve

I've been thinking about this a bit today... it might be possible to default to bigint, although that has performance issues I'm guessing (granted, I doubt they're big ones). Maybe indexes need some flag to default all integers to big ints...

Consider it on my list of things to do. If you'd like to create a ticket for this on GitHub, that'd be fantastic.

Cheers

--
Pat

Steve H

unread,
Jan 5, 2010, 1:38:51 PM1/5/10
to Thinking Sphinx
Consider it done. Ticket created. :)

Cheers!

-Steve

> ...
>
> read more »

Reply all
Reply to author
Forward
0 new messages