Hi All,
I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?
It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)
so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default)
I am just trying to understand how the actual data is laid out on the disk (ignore the index file that helps to get to the right SSTable)
--Thanks!
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/9e906a33-ee7a-479c-8669-3d8b2aafe2e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi All,
I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable
On Wed, May 10, 2017 at 10:00 AM, kant kodali <kant...@gmail.com> wrote:Hi All,
I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstableWe have several Wiki pages about this topic -
Each sstable is composed of several files - the data file contains the actual data; the index file contains (to make a long story short) a list of partition keys and their location in the data file; the summary file allows finding partitions in the index file; the compression file is used to compress the data file; and a few more. We describe those in detail in
https://github.com/scylladb/scylla/wiki/SSTables-Data-File
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
https://github.com/scylladb/scylla/wiki/SSTables-Summary-File
The data file format is based on historical Cassandra concepts that predate the advent of CQL or clustering keys, so you need to understand how modern concepts like clustering keys, static rows and containers, translate to the concepts like like "cells" in the sstable data file. This is explained here:
https://github.com/scylladb/scylla/wiki/SSTables-interpretation-in-Urchin(the name of the document is amusingly out of date :-)).Nadav.
--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-users+unsubscribe@googlegroups.com.
To post to this group, send email to scylladb-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CANEVyjuXdPOJxtuJsQCLZdssMJz8YSmPHMewGw2U5yEjzedNZg%40mail.gmail.com.
On Wed, May 10, 2017 at 11:28 AM, Nadav Har'El <n...@scylladb.com> wrote:On Wed, May 10, 2017 at 10:00 AM, kant kodali <kant...@gmail.com> wrote:Hi All,
I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstableWe have several Wiki pages about this topic -
Each sstable is composed of several files - the data file contains the actual data; the index file contains (to make a long story short) a list of partition keys and their location in the data file; the summary file allows finding partitions in the index file; the compression file is used to compress the data file; and a few more. We describe those in detail in
https://github.com/scylladb/scylla/wiki/SSTables-Data-File
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
https://github.com/scylladb/scylla/wiki/SSTables-Summary-FileNadav, FYIWe cloned the Wiki to docs.scylladb.co http://docs.scylladb.com/kb/ :
- SSTable Compression Deep dive into Scylla/Cassandra SSTable Compression
- SSTable Data File Deep dive into Scylla/Cassandra SSTable format
- SSTable format in Scylla Scylla SSTable are compatible to Cassandra 2.1.8, but why there are more of them?
- SSTable Interpretation Deep dive into Scylla/Cassandra SSTable Interpretation in Scylla
- SSTable Summary File Deep dive into Scylla/Cassandra SSTable Summary file format
We should have the Wiki point to the doc site, so we do not maintain two copies.
On 05/10/2017 10:00 AM, kant kodali wrote:
Hi All,
I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?
It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)
That's correct.
Did you take a look at https://github.com/scylladb/scylla/wiki/SSTables-Data-File? It explains the format in great detail.
so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default)
Well, without the index you have to read the file from beginning to end, because you don't know where the partition starts.
The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.
On Wednesday, May 10, 2017 at 12:24:47 AM UTC-7, Avi Kivity wrote:
On 05/10/2017 10:00 AM, kant kodali wrote:
Hi All,
I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?
It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)
That's correct.
Did you take a look at https://github.com/scylladb/scylla/wiki/SSTables-Data-File? It explains the format in great detail.
so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default)
Well, without the index you have to read the file from beginning to end, because you don't know where the partition starts.
The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.
Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column? other words, if the partition is of length n and to find one column in a large partition is it a O(n) search or O(log n) search ?
On Wednesday, May 10, 2017 at 7:51:01 AM UTC-7, kant kodali wrote:
On Wednesday, May 10, 2017 at 12:24:47 AM UTC-7, Avi Kivity wrote:
On 05/10/2017 10:00 AM, kant kodali wrote:
Hi All,
I am trying to understand Scylladb or C* SStable format. Specifically I am trying to understand how ScyllaDB or Cassandra stores data on sstable and what does it scan if I were read one partition and one column?
It looks like OnDiskAtom stores rows where A row's value is a list of atoms, each of which is usually a cell (a column name and value)
That's correct.
Did you take a look at https://github.com/scylladb/scylla/wiki/SSTables-Data-File? It explains the format in great detail.
so if I were to do select column from table where paritionKey="foo" I have to read the entire partition or no? (Not looking for an optimizations it can or will do but rather just trying to understand how it works by default)
Well, without the index you have to read the file from beginning to end, because you don't know where the partition starts.
The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.
Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column? other words, if the partition is of length n and to find one column in a large partition is it a O(n) search or O(log n) search ?
--
I am just trying to understand how the actual data is laid out on the disk (ignore the index file that helps to get to the right SSTable)
The data is just a sorted list of partitions, composed of a sorted list of cells and tombstones. It's not really useful for reads without the index, because you can't just seek randomly in the middle and start reading.
--Thanks!
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/9e906a33-ee7a-479c-8669-3d8b2aafe2e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To post to this group, send email to scyllad...@googlegroups.com.
Visit this group at https://groups.google.com/group/scylladb-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/0ecfe5c5-e79c-4a34-8b3c-484c7aaf0f27%40googlegroups.com.
The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column?
On Wed, May 10, 2017 at 5:51 PM, kant kodali <kant...@gmail.com> wrote:The index allows you to locate the beginning of a partition, and when a "promoted index" is available (it's part of Index.db), it allows locating a specific column in the partition to within 64kB.Got it. so if promoted index allows locating a specific column in the partition to within 64kB then What happens if my partition is 2GB? does it potentially seek through 2GB even if I want to select just one column?If the partition is 2GB, and you only need to read one small column, the promoted index allows you to find the right 64 KB (by default) section of this partition, and only read this 64 KB (and not 2GB).However, the existing format (inherited from Cassandra) has a problem that as the partition grows, the "promoted index" grows - a 2 GB partition has about 30,000 of those 64 KB segments, so in the index file we have for each partition a long list of 30,000 segments - and each time we read from the partition we need to read this entire list.
For extremely long partitions (like 2GB), this list can easily be larger than the 64 KB we will read from the data file. I am not sure wha you mean by this? You just explained the list size will be 2GB/64KB ~ 30K right? or you saying that column_index_size_in_kb can be increased greater than 64KB for faster retrieval?Nadav.
For extremely long partitions (like 2GB), this list can easily be larger than the 64 KB we will read from the data file. I am not sure wha you mean by this? You just explained the list size will be 2GB/64KB ~ 30K right? or you saying that column_index_size_in_kb can be increased greater than 64KB for faster retrieval? I am trying to understand if I were to read a column in this 2GB partition do I need to go through every block and compare the start and finish column to see if the column that I am looking for falls under that range and if so, I would use that particular promoted index to get the offset of the column I am looking for and just read that column. Essentially a O(n) operation ?Nadav.
If the partition is 2GB, and you only need to read one small column, the promoted index allows you to find the right 64 KB (by default) section of this partition, and only read this 64 KB (and not 2GB).However, the existing format (inherited from Cassandra) has a problem that as the partition grows, the "promoted index" grows - a 2 GB partition has about 30,000 of those 64 KB segments, so in the index file we have for each partition a long list of 30,000 segments - and each time we read from the partition we need to read this entire list.Got it! This make sense.For extremely long partitions (like 2GB), this list can easily be larger than the 64 KB we will read from the data file. I am not sure wha you mean by this? You just explained the list size will be 2GB/64KB ~ 30K right? or you saying that column_index_size_in_kb can be increased greater than 64KB for faster retrieval? I am trying to understand if I were to read a column in this 2GB partition do I need to go through every block and compare the start and finish column to see if the column that I am looking for falls under that range and if so, I would use that particular promoted index to get the offset of the column I am looking for and just read that column. Essentially a O(n) operation ?