How is SSTable laid out on the disk? and what parts of SSTable are scanned while issuing a query?

995 views
Skip to first unread message

kant kodali

<kanth909@gmail.com>
unread,
May 10, 2017, 3:00:46 AM5/10/17
to ScyllaDB users
Hi All,

I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?

It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)

so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default) 

I am just trying to understand how the actual data is laid out on the disk (ignore the index file that helps to get to the right SSTable)

Thanks!

Avi Kivity

<avi@scylladb.com>
unread,
May 10, 2017, 3:24:47 AM5/10/17
to scylladb-users@googlegroups.com, kant kodali



On 05/10/2017 10:00 AM, kant kodali wrote:
Hi All,

I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?

It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)


That's correct.

Did you take a look at https://github.com/scylladb/scylla/wiki/SSTables-Data-File? It explains the format in great detail.


so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default)

Well, without the index you have to read the file from beginning to end, because you don't know where the partition starts.

The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.



I am just trying to understand how the actual data is laid out on the disk (ignore the index file that helps to get to the right SSTable)


The data is just a sorted list of partitions, composed of a sorted list of cells and tombstones. It's not really useful for reads without the index, because you can't just seek randomly in the middle and start reading.

Thanks!
--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/9e906a33-ee7a-479c-8669-3d8b2aafe2e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nadav Har'El

<nyh@scylladb.com>
unread,
May 10, 2017, 4:28:47 AM5/10/17
to scylladb-users@googlegroups.com
On Wed, May 10, 2017 at 10:00 AM, kant kodali <kant...@gmail.com> wrote:
Hi All,

I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable

We have several Wiki pages about this topic -

Each sstable is composed of several files - the data file contains the actual data; the index file contains (to make a long story short) a list of partition keys and their location in the data file; the summary file allows finding partitions in the index file; the compression file is used to compress the data file; and a few more. We describe those in detail in

https://github.com/scylladb/scylla/wiki/SSTables-Data-File
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
https://github.com/scylladb/scylla/wiki/SSTables-Summary-File

The data file format is based on historical Cassandra concepts that predate the advent of CQL or clustering keys, so you need to understand how modern concepts like clustering keys, static rows and containers, translate to the concepts like like "cells" in the sstable data file. This is explained here:

https://github.com/scylladb/scylla/wiki/SSTables-interpretation-in-Urchin

(the name of the document is amusingly out of date :-)).

Nadav.

Tzach Livyatan

<tzach@scylladb.com>
unread,
May 10, 2017, 5:57:18 AM5/10/17
to ScyllaDB users
On Wed, May 10, 2017 at 11:28 AM, Nadav Har'El <n...@scylladb.com> wrote:

On Wed, May 10, 2017 at 10:00 AM, kant kodali <kant...@gmail.com> wrote:
Hi All,

I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable

We have several Wiki pages about this topic -

Each sstable is composed of several files - the data file contains the actual data; the index file contains (to make a long story short) a list of partition keys and their location in the data file; the summary file allows finding partitions in the index file; the compression file is used to compress the data file; and a few more. We describe those in detail in

https://github.com/scylladb/scylla/wiki/SSTables-Data-File
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
https://github.com/scylladb/scylla/wiki/SSTables-Summary-File

Nadav, FYI

We should have the Wiki point to the doc site, so we do not maintain two copies. 



The data file format is based on historical Cassandra concepts that predate the advent of CQL or clustering keys, so you need to understand how modern concepts like clustering keys, static rows and containers, translate to the concepts like like "cells" in the sstable data file. This is explained here:

https://github.com/scylladb/scylla/wiki/SSTables-interpretation-in-Urchin

(the name of the document is amusingly out of date :-)).

Nadav.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.

Nadav Har'El

<nyh@scylladb.com>
unread,
May 10, 2017, 6:45:11 AM5/10/17
to scylladb-users@googlegroups.com
On Wed, May 10, 2017 at 12:57 PM, Tzach Livyatan <tz...@scylladb.com> wrote:

On Wed, May 10, 2017 at 11:28 AM, Nadav Har'El <n...@scylladb.com> wrote:

On Wed, May 10, 2017 at 10:00 AM, kant kodali <kant...@gmail.com> wrote:
Hi All,

I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable

We have several Wiki pages about this topic -

Each sstable is composed of several files - the data file contains the actual data; the index file contains (to make a long story short) a list of partition keys and their location in the data file; the summary file allows finding partitions in the index file; the compression file is used to compress the data file; and a few more. We describe those in detail in

https://github.com/scylladb/scylla/wiki/SSTables-Data-File
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
https://github.com/scylladb/scylla/wiki/SSTables-Summary-File

Nadav, FYI

We should have the Wiki point to the doc site, so we do not maintain two copies. 

I have no idea, though, how to update this "knowledge base". In particular, I see now the "Index file" (see https://github.com/scylladb/scylla/wiki/SSTables-Index-File) is missing in the knowledge base, and the "summary file" is also an old version of what is now in the Wiki.

Unless someone plans to edit this knowledge base continuously - or you teach me (and the others) how to do it (does it come from some git repository? Do I have the writes to push to it? Or what?), I think the right thing would be to remove the outdated knowledge base pages, not the wiki.... 

Nadav.

kant kodali

<kanth909@gmail.com>
unread,
May 10, 2017, 10:51:01 AM5/10/17
to ScyllaDB users, kanth909@gmail.com

On Wednesday, May 10, 2017 at 12:24:47 AM UTC-7, Avi Kivity wrote:



On 05/10/2017 10:00 AM, kant kodali wrote:
Hi All,

I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?

It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)


That's correct.

Did you take a look at https://github.com/scylladb/scylla/wiki/SSTables-Data-File? It explains the format in great detail.

so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default)

Well, without the index you have to read the file from beginning to end, because you don't know where the partition starts.

The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.

   Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column?

kant kodali

<kanth909@gmail.com>
unread,
May 10, 2017, 11:01:43 AM5/10/17
to ScyllaDB users, kanth909@gmail.com


On Wednesday, May 10, 2017 at 7:51:01 AM UTC-7, kant kodali wrote:

On Wednesday, May 10, 2017 at 12:24:47 AM UTC-7, Avi Kivity wrote:



On 05/10/2017 10:00 AM, kant kodali wrote:
Hi All,

I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?

It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)


That's correct.

Did you take a look at https://github.com/scylladb/scylla/wiki/SSTables-Data-File? It explains the format in great detail.

so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default)

Well, without the index you have to read the file from beginning to end, because you don't know where the partition starts.

The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.

   Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column? other words, if the partition is of length n and to find one column in a large partition is it a O(n) search or O(log n) search ? 

Avi Kivity

<avi@scylladb.com>
unread,
May 10, 2017, 11:05:33 AM5/10/17
to scylladb-users@googlegroups.com, kant kodali



On 05/10/2017 06:01 PM, kant kodali wrote:


On Wednesday, May 10, 2017 at 7:51:01 AM UTC-7, kant kodali wrote:

On Wednesday, May 10, 2017 at 12:24:47 AM UTC-7, Avi Kivity wrote:



On 05/10/2017 10:00 AM, kant kodali wrote:
Hi All,

I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?

It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)


That's correct.

Did you take a look at https://github.com/scylladb/scylla/wiki/SSTables-Data-File? It explains the format in great detail.

so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default)

Well, without the index you have to read the file from beginning to end, because you don't know where the partition starts.

The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.

   Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column? other words, if the partition is of length n and to find one column in a large partition is it a O(n) search or O(log n) search ?


It's still O(n) unfortunately (the index is not organized as a btree) but a much smaller n.  If your rows are tiny and your partitions very large you well spend a lot of I/O and cpu scanning the index.




I am just trying to understand how the actual data is laid out on the disk (ignore the index file that helps to get to the right SSTable)


The data is just a sorted list of partitions, composed of a sorted list of cells and tombstones. It's not really useful for reads without the index, because you can't just seek randomly in the middle and start reading.

Thanks!
--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/9e906a33-ee7a-479c-8669-3d8b2aafe2e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.

Nadav Har'El

<nyh@scylladb.com>
unread,
May 10, 2017, 11:07:27 AM5/10/17
to scylladb-users@googlegroups.com, kanth909@gmail.com
On Wed, May 10, 2017 at 5:51 PM, kant kodali <kant...@gmail.com> wrote:


The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.

   Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column?


If the partition is 2GB, and you only need to read one small column, the promoted index allows you to find the right 64 KB (by default) section of this partition, and only read this 64 KB (and not 2GB).

However, the existing format (inherited from Cassandra) has a problem that as the partition grows, the "promoted index" grows - a 2 GB partition has about 30,000 of those 64 KB segments, so in the index file we have for each partition a long list of 30,000 segments - and each time we read from the partition we need to read this entire list. For extremely long partitions (like 2GB), this list can easily be larger than the 64 KB we will read from the data file.

Nadav.

kant kodali

<kanth909@gmail.com>
unread,
May 10, 2017, 11:57:35 AM5/10/17
to ScyllaDB users, kanth909@gmail.com


On Wednesday, May 10, 2017 at 8:07:27 AM UTC-7, Nadav Har'El wrote:

On Wed, May 10, 2017 at 5:51 PM, kant kodali <kant...@gmail.com> wrote:


The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.

   Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column?


If the partition is 2GB, and you only need to read one small column, the promoted index allows you to find the right 64 KB (by default) section of this partition, and only read this 64 KB (and not 2GB).

However, the existing format (inherited from Cassandra) has a problem that as the partition grows, the "promoted index" grows - a 2 GB partition has about 30,000 of those 64 KB segments, so in the index file we have for each partition a long list of 30,000 segments - and each time we read from the partition we need to read this entire list.
  
    Got  it! This make sense. 

 
For extremely long partitions (like 2GB), this list can easily be larger than the 64 KB we will read from the data file. I am not sure wha you mean by this? You just explained the list size will be 2GB/64KB ~ 30K right? or you saying that column_index_size_in_kb can be increased greater than 64KB for faster retrieval?

Nadav.

kant kodali

<kanth909@gmail.com>
unread,
May 10, 2017, 12:16:29 PM5/10/17
to ScyllaDB users, kanth909@gmail.com
For extremely long partitions (like 2GB), this list can easily be larger than the 64 KB we will read from the data file. I am not sure wha you mean by this? You just explained the list size will be 2GB/64KB ~ 30K right? or you saying that column_index_size_in_kb can be increased greater than 64KB for faster retrieval? I am trying to understand if I were to read a column in this 2GB partition do I need to go through every block and compare the start and finish column to see if the column that I am looking for falls under that range and if so, I would use that particular promoted index to get the offset of the column I am looking for and just read that column. Essentially a O(n) operation ?

Nadav.

Nadav Har'El

<nyh@scylladb.com>
unread,
May 11, 2017, 8:31:34 AM5/11/17
to scylladb-users@googlegroups.com, kanth909@gmail.com

--
Nadav Har'El
n...@scylladb.com

On Wed, May 10, 2017 at 7:16 PM, kant kodali <kant...@gmail.com> wrote:
If the partition is 2GB, and you only need to read one small column, the promoted index allows you to find the right 64 KB (by default) section of this partition, and only read this 64 KB (and not 2GB).

However, the existing format (inherited from Cassandra) has a problem that as the partition grows, the "promoted index" grows - a 2 GB partition has about 30,000 of those 64 KB segments, so in the index file we have for each partition a long list of 30,000 segments - and each time we read from the partition we need to read this entire list.
  
    Got  it! This make sense. 

 
For extremely long partitions (like 2GB), this list can easily be larger than the 64 KB we will read from the data file. I am not sure wha you mean by this? You just explained the list size will be 2GB/64KB ~ 30K right? or you saying that column_index_size_in_kb can be increased greater than 64KB for faster retrieval? I am trying to understand if I were to read a column in this 2GB partition do I need to go through every block and compare the start and finish column to see if the column that I am looking for falls under that range and if so, I would use that particular promoted index to get the offset of the column I am looking for and just read that column. Essentially a O(n) operation ?

I explained that the list for one partition will have 30,000 items, but they will not be one byte each, so the on-disk list (in the "promoted index" in the index file, please refer to the wiki I pointed to) will be significantly more than 30 KB. And as Avi said, it is O(n) - i.e., if a 2 GB partition will have X bytes in the index file, a 4 GB partition will have 2*X bytes. At some point when the partition is long enough, we need to read more from the index file than we need to read from the data file. This is a deficiency of this sstable format we inherited (for backward compatibility reasons) from Cassandra.

And yes, we do use a linear search to find the block containing the column you're after. *Had* we cached the promoted index in memory, we could save it as a binary tree and get O(log N) search - see https://github.com/scylladb/scylla/issues/1469.

And yes, you can increase column_index_size_in_kb when you know you have super-huge partitions and want to have to read less data from the index - at the expense of reading more data from the data file.
An even more promising direction of reducing the amount of data read from the index file is to reduce the *sampling rate* (usually we read 128 index entries every time. when you have huge partitions, it makes sense to read fewer - even just one). https://github.com/scylladb/scylla/issues/1842 is about doing that automatically.
Reply all
Reply to author
Forward
0 new messages