What is currently the best approach to implement full text and geospatial searches?

892 views
Skip to first unread message

turecki@gmail.com

<turecki@gmail.com>
unread,
Sep 9, 2017, 7:00:53 PM9/9/17
to ScyllaDB users
I am almost decided to pick Scylla DB for a startup project as it ticks most boxes easily and for it's amazing performance.
My doubts are about scalable querying of geospatial data and/or full text search.
I know secondary indexes are planned in 2.1 but the scope of compatibility with Cassandra in this area bothers me slightly (using cassandra-lucene-index for example).
It makes sense to keep such indexes in a separate, scalable cluster but on the other hand with the powerful chunk per core processing in Scylla,
such processing could be performed locally by extra core(s) with thread communication on a single machine and scaled/split accordingly provided internal architecture (promises/non-locking) allows it.
What is currently the best way to implement such indexes (most compatibile/recommended solution) and what will change after adding secondary indexes feature or even further on the roadmap?

Alexander Sicular

<siculars@scylladb.com>
unread,
Sep 10, 2017, 12:25:40 AM9/10/17
to scylladb-users@googlegroups.com
Hi turecki, 

thanks for the question. As you pointed out, Scylla scales on cores and, in fact, can be limited to certain cores on the instance [0]. Actually, Scylla likes to reserve cpu 0 for nic irq's [1][2][3]. Having said that, hypothetically, you could imagine a situation where you had a large multi-core instance and wanted to pin certain calculations, like, say geospatial, to certain cores. Those processes could be under the Scylla umbrella or could be in some other process like Lucene with it's own considerations. That could maybe work but geospatial/full text search functionality is not in Scylla today. There is a guide on how to run Scylla in a shared environment[4] if you wanted to colocate elastic/solr on a scylla instance (I wouldn't recommend it). Secondary indicies (2i) will not help you with full text or geospatial searching.

In reality, the best way to do what you want to do today would be to have an additional Lucene based cluster dedicated to geospatial/full text functionality and synchronize data between the two in your own middleware layer. Having said that, there is a way that you can sort of hack geolocation related features into a key/value capable data-model system like Scylla/Cassandra/Riak/Redis and that is by using geohashing[5][6]. Consider things like people, cars, sensors, etc. moving through space/time and regularly registering their location using a geohash of a given precision. Your primary key would be that hash at a minimum. One consideration would be write amplification at levels of geohash precision. Whether or not you could live with the limitations of such a system is up to your use case. 

Hope this helps,

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/25fb6dd6-21e3-4f54-8e55-024132b7b1b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Alexander Sicular
Senior Solutions Architect, ScyllaDB
@siculars
sicu...@scylladb.com

Michal Turecki

<turecki@gmail.com>
unread,
Sep 13, 2017, 2:06:27 AM9/13/17
to ScyllaDB users
Hi Alexander,

Thank you for a very helpful answer. After looking into it it makes sense to run CPU intensive full text search separately but in my humble opinion geohash functionality is a prime candidate to be included in the Scylla core.
Looking into how it is implemented in Redis/Pedis [0] it can operate on ranges of int64 values so it should be performant when close to the primary key lookup when implemented as a secondary key. Having it in a separate cluster would mean (like for text search) many primary key lookups will be made based on search results unless some workaround is implemented to cache frequently searched phrases (in my use case geo + text search will be the primary way to find results).

You mentioned using geohash as a primary key as a current workaround, this is really interesting considering I guess the value will contain references to all primary item keys in this location - very neat considering number of such keys will be reasonably low (probably is in my case).

Unless I will find a shortcut I will follow the Lucene in a separate cluster suggestion as a safe bet. Thank you for taking the time to answer, much appreciated.



On Sunday, 10 September 2017 05:25:40 UTC+1, Alexander Sicular wrote:
Hi turecki, 

thanks for the question. As you pointed out, Scylla scales on cores and, in fact, can be limited to certain cores on the instance [0]. Actually, Scylla likes to reserve cpu 0 for nic irq's [1][2][3]. Having said that, hypothetically, you could imagine a situation where you had a large multi-core instance and wanted to pin certain calculations, like, say geospatial, to certain cores. Those processes could be under the Scylla umbrella or could be in some other process like Lucene with it's own considerations. That could maybe work but geospatial/full text search functionality is not in Scylla today. There is a guide on how to run Scylla in a shared environment[4] if you wanted to colocate elastic/solr on a scylla instance (I wouldn't recommend it). Secondary indicies (2i) will not help you with full text or geospatial searching.

In reality, the best way to do what you want to do today would be to have an additional Lucene based cluster dedicated to geospatial/full text functionality and synchronize data between the two in your own middleware layer. Having said that, there is a way that you can sort of hack geolocation related features into a key/value capable data-model system like Scylla/Cassandra/Riak/Redis and that is by using geohashing[5][6]. Consider things like people, cars, sensors, etc. moving through space/time and regularly registering their location using a geohash of a given precision. Your primary key would be that hash at a minimum. One consideration would be write amplification at levels of geohash precision. Whether or not you could live with the limitations of such a system is up to your use case. 

Hope this helps,
On Sat, Sep 9, 2017 at 7:00 PM, <tur...@gmail.com> wrote:
I am almost decided to pick Scylla DB for a startup project as it ticks most boxes easily and for it's amazing performance.
My doubts are about scalable querying of geospatial data and/or full text search.
I know secondary indexes are planned in 2.1 but the scope of compatibility with Cassandra in this area bothers me slightly (using cassandra-lucene-index for example).
It makes sense to keep such indexes in a separate, scalable cluster but on the other hand with the powerful chunk per core processing in Scylla,
such processing could be performed locally by extra core(s) with thread communication on a single machine and scaled/split accordingly provided internal architecture (promises/non-locking) allows it.
What is currently the best way to implement such indexes (most compatibile/recommended solution) and what will change after adding secondary indexes feature or even further on the roadmap?

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.

ddorian43

<dorian.hoxha@gmail.com>
unread,
Sep 24, 2017, 4:18:23 PM9/24/17
to ScyllaDB users
Secondary indexing is one thing and having a full-lucene-clone as index is another.

If you really want it in 1 cluster (fewer ops or whatever) look into elassandra (cassandra + es).

The per-core scaling on full-text-search is great when you need google/twitter-scale (like you're bottlenecked on bitset merging or document ordering/scoring) which will still be better in separate cluster. Or even redis-search will do that (though need $ for redis labs cluster)

I think you REALLY reevaluate how much you need the performance and decide what tools you use.
Reply all
Reply to author
Forward
0 new messages