Distributed indexing for large datasets and multi-core boxes

106 views

Skip to first unread message

mixonic

unread,

Nov 19, 2009, 5:44:38 PM11/19/09

to Thinking Sphinx

Hi Pat, friendly folk,

I've got a 375_000 row geo data set indexed by lat and lon. I can
search just great, but queries usually take about 650ms on our
production HW. I found there is a huge speed boost using the main
index to walk a split set of sources and indexes. So instead of:

index datapoint
{
type = distributed
local = datapoint_core
}

I have:

index datapoint
{
type = distributed
local = datapoint_core_0
local = datapoint_core_1
local = datapoint_core_2
local = datapoint_core_3
}

Each source then has a range by IDs:

source datapoint_core_0

sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1)
FROM `restaurant_inspection_datapoints` WHERE `id` BETWEEN 0 AND 93878

Where 93878 is 1/4 of the records. Each index covers 1/4 of the total
IDs. On my laptop alone, this gave me massive gains...queries that
could take a second tool 300ms.

I would love to get this into riddle or thinking_sphinx, but the
riddle configuration code is really complex. Is anyone interested in
working with me on this or giving me a starting point?

Pat, does that use of "local" make sense? All the other examples out
there use agents.

See:

http://www.sphinxsearch.com/docs/current.html#distributed
http://www.sphinxsearch.com/bugs/view.php?id=407 <-- I didn't
encounter that bug
http://blog.wasimasif.com/sphinx-distributed-searching/

Thanks all, I'm very excited to get this up!

--
Matthew Beale :: 607 227 0871
Resume & Portfolio @ http://madhatted.com

Pat Allan

unread,

Nov 25, 2009, 12:08:25 AM11/25/09

to thinkin...@googlegroups.com

Hi Matthew

That's a pretty neat setup - didn't realise splitting things out
provides such a good increase in speed. As for your Sphinx syntax,
using local instead of agent is correct, because the indexes are
local, not remote.

As for pulling this into Sphinx, I've been working on some changes to
allow multiple indexes per model, so with some metaprogramming, it'd
be easy to create four indexes, filtering by a quarter of the results
each time. Hopefully these changes will go live within the next week
or so.

Cheers

--
Pat

> --
>
> You received this message because you are subscribed to the Google
> Groups "Thinking Sphinx" group.
> To post to this group, send email to thinkin...@googlegroups.com.
> To unsubscribe from this group, send email to thinking-sphi...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=
> .
>
>

Reply all

Reply to author

Forward

0 new messages