ScyllaDB use cases

Markus Tenghamn

<markus@tenghamn.com>

unread,

Aug 9, 2021, 5:18:25 PM8/9/21

to ScyllaDB users

I am new to ScyllaDB and it is being pushed by someone in a management position for me to implement and use. I have been able to get a cluster running successfully on Google Cloud and I can make queries without issue.

The trouble I have is finding a use case for ScyllaDB even if it looks like a very good solution to many problems. We have mostly structured data and it is important that it's not lost and it should always be there when queried if it has been inserted. Currently scylla is being used to handle authentication with data such as API keys which seems wrong to me but I am still learning about ScyllaDB. We are also using lots and lots of materialized views which I am not sure is so good either.

Here is an example table.

CREATE TABLE object_id (

uid text,

ag_uid text,

year int,

month int,

day int,

hour int,

minute int,

second int,

ag_uid_pg_uid text,

id text,

application_number text,

PRIMARY KEY ((uid, ag_uid, year, month), day, hour, minute, second, ag_uid_pg_uid, id)

My understanding is that this is not a great way to structure the primary key. I thought it would be optimal to have a column as a key? But I do not know enough about ScyllaDB to argue for or against it. To me it would seem that an RDBMS is more optimal if we want to query data on multiple columns, especially on structured data which this is.

Is there a good guide for when to go with ScyllaDB and when to go with a relational database?

What makes ScyllaDB different from other NoSQL solutions other than performance?

Avishai Ish Shalom

<avishai.ishshalom@scylladb.com>

unread,

Aug 10, 2021, 6:12:57 AM8/10/21

to ScyllaDB users

NoSQL is kind of an umbrella term, so the question "What makes ScyllaDB different from other NoSQL solutions" is kind of vague. Broadly speaking, Scylla belongs to a category of databases known as wide-columnar databases - but that's actually a slightly misleading name. Let me give a brief overview:
A central scaling strategy of distributed databases is sharding or partitioning data according to some "primary" partition key. This strategy is common to most NoSQL databases, and the primary difference in regards to sharding is whether data is broken down to ranges of consecutive keys - e.g. HBase, or sparse buckets using a modulo on the hash of the partition key - e.g. Scylla, Cassandra, DynamoDB, Mongo. The Reason for this difference is that it is very expensive (if not impractical) to do operations which involved disjoint shards/ranges. Wide-column databases add a secondary division of data - kind of an indexed list inside a hash table, or a hash table inside a hash table; the secondary key is known as a clustering key or ordering key and allows performing operations on a group of objects within a single shard/partition. For many applications, this compromise is good enough as long as you can model your data to have related objects within a single partition (without that partition growing too large). Of course, you can always ignore the secondary layer and use Scylla as a key-value store. As a clarifying comparison, MongoDB or Couchbase are key-value databases which do not have this secondary partition but they do not treat the value as opaque and can do some server-side processing on the value - which is known as a "document".
Due to the limits of partitioning/sharding/ranges, NoSQL databases in general require that you begin your data modeling from the queries: you are going to store data in a way that's optimal for the questions you are going to ask and not in a generic way. Thus you always start from the use case/application and work your way downwards to the database. Note that almost all applications can be modelled on top of Scylla, but there will be tradeoffs (there always are 🤷) - you may need to duplicate data in several different representations for example. Whether these tradeoffs make sense depends highly on your application and it's requirements. Generally speaking Scylla is a good fit for interactive applications which require low latency, deal with datasets larger than 0.5TB and have throughput >> 10k ops/sec, but people are using it for a variety of other things; it makes sense to have less databases to maintain even if that means having some uses that are not optimal.

Regarding your question on whether the schema makes sense - as you may have gathered from the above prose we would need the queries to know if the schema makes sense. You could definitely use a primary key like that and it would make sense if you wanted to scan on all ag_uid_pg_uid, id values for example; you may or may not want to mangle some of the columns to a single columns depending on how you use them.

Hope that helps a bit,

Avishai

Pekka Enberg

<penberg@scylladb.com>

unread,

Aug 10, 2021, 6:32:21 AM8/10/21

to ScyllaDB users, Markus Tenghamn, Avishai Ish Shalom

Hi Markus,

On Tue, Aug 10, 2021 at 12:18 AM Markus Tenghamn <mar...@tenghamn.com> wrote:
> Is there a good guide for when to go with ScyllaDB and when to go with a relational database?

I am not aware of such a guide, but I have found the following blog
post helpful to get started:

https://towardsdatascience.com/datastore-choices-sql-vs-nosql-database-ebec24d56106

The "cheat sheet" at the end of the post is particularly helpful.

Hope this helps!

Regards,

- Pekka

Reply all

Reply to author

Forward