SSTable index-file format documentation

Nadav Har'El

<nyh@scylladb.com>

unread,

Jun 28, 2016, 5:01:10 PM6/28/16

to scylladb-dev

While we have some fairly detailed documentation on the Cassandra (2.*) sstable file format in the ScyllaDB wiki, one important piece that was still missing was the format of the index file.

So starting today, this piece is no longer missing. Check out:

https://github.com/scylladb/scylla/wiki/SSTables-Index-File

Importantly, this document also explains the "promoted index" feature, which allows skipping inside very long partitions (sstable rows), without reading an entire partition. Understanding this feature was necessary before we can fix issue https://github.com/scylladb/scylla/issues/959.

--
Nadav Har'El

n...@scylladb.com

Duarte Nunes

<duarte@scylladb.com>

unread,

Jun 28, 2016, 6:24:25 PM6/28/16

to Nadav Har'El, scylladb-dev

Very nice!

Avi Kivity

<avi@scylladb.com>

unread,

Jun 29, 2016, 12:52:00 AM6/29/16

to Nadav Har'El, scylladb-dev

Very disappointing (the format, not the docs, which are excellent).

In the worst case the index entry can be bigger than the block it is indexing! And in any case, in C*2.x, the cell names contain the full clustering key plus the clustering key column names plus the column name of the first column in the block (which, if it happens to be an unfrozen collection, will include the collection element key).

It should work well for partitions containing large blobs though, rather than many small cql rows.

I see in C*3.x they have an "offsets map", but it is still O(n). bsearch()ing on disk is possible, but very expensive, this is why btrees were invented.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/CANEVyjvfnscvZnBYgTFwa8NnmRQBXE%3DCf9eFno3YXB9Ft%3DCP8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Duarte Nunes

<duarte@scylladb.com>

unread,

Jun 29, 2016, 3:16:07 AM6/29/16

to Avi Kivity, Nadav Har'El, scylladb-dev

On 29/06/16 06:51, Avi Kivity wrote:

Very disappointing (the format, not the docs, which are excellent).

In the worst case the index entry can be bigger than the block it is indexing! And in any case, in C*2.x, the cell names contain the full clustering key plus the clustering key column names plus the column name of the first column in the block (which, if it happens to be an unfrozen collection, will include the collection element key).

It should work well for partitions containing large blobs though, rather than many small cql rows.

I see in C*3.x they have an "offsets map", but it is still O(n). bsearch()ing on disk is possible, but very expensive, this is why btrees were invented.

The format seems to try and avoid seeks, like it's optimizing for HDDs.

On 06/29/2016 12:01 AM, Nadav Har'El wrote:

While we have some fairly detailed documentation on the Cassandra (2.*) sstable file format in the ScyllaDB wiki, one important piece that was still missing was the format of the index file.

So starting today, this piece is no longer missing. Check out:

https://github.com/scylladb/scylla/wiki/SSTables-Index-File

Importantly, this document also explains the "promoted index" feature, which allows skipping inside very long partitions (sstable rows), without reading an entire partition. Understanding this feature was necessary before we can fix issue https://github.com/scylladb/scylla/issues/959.

--
Nadav Har'El

n...@scylladb.com

--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/CANEVyjvfnscvZnBYgTFwa8NnmRQBXE%3DCf9eFno3YXB9Ft%3DCP8g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/1ecc268f-a5bd-4f73-54cc-b4ebe7a6f531%40scylladb.com.

Avi Kivity

<avi@scylladb.com>

unread,

Jun 29, 2016, 4:02:30 AM6/29/16

to Duarte Nunes, Nadav Har'El, scylladb-dev

On 06/29/2016 10:16 AM, Duarte Nunes wrote:
> On 29/06/16 06:51, Avi Kivity wrote:
>> Very disappointing (the format, not the docs, which are excellent).
>>
>> In the worst case the index entry can be bigger than the block it is
>> indexing! And in any case, in C*2.x, the cell names contain the full
>> clustering key plus the clustering key column names plus the column
>> name of the first column in the block (which, if it happens to be an
>> unfrozen collection, will include the collection element key).
>>
>> It should work well for partitions containing large blobs though,
>> rather than many small cql rows.
>>
>> I see in C*3.x they have an "offsets map", but it is still O(n).
>> bsearch()ing on disk is possible, but very expensive, this is why
>> btrees were invented.
> The format seems to try and avoid seeks, like it's optimizing for HDDs.

It's not avoiding seeks.

If you hold the offset map in memory, then you still have an O(n) data
structure in memory, which means that as your sstables grow they consume
more memory.

If you keep the offset map on disk, then you have to perform (log n) -
(log block_size/sizeof(offset)) seeks, compared to a btree which does
(log n) / (log block_size/sizeof(offset)) seeks.

Let's have n=1,000,000, block_size=8k, sizeof(offset) = 8; then the C*
index gives you 10 seeks, btree gives you 2 seeks.

Nadav Har'El

<nyh@scylladb.com>

unread,

Jun 29, 2016, 4:29:42 AM6/29/16

to Avi Kivity, Duarte Nunes, scylladb-dev

On Wed, Jun 29, 2016 at 11:02 AM, Avi Kivity <a...@scylladb.com> wrote:

On 06/29/2016 10:16 AM, Duarte Nunes wrote:

On 29/06/16 06:51, Avi Kivity wrote:

Very disappointing (the format, not the docs, which are excellent).

In the worst case the index entry can be bigger than the block it is indexing! And in any case, in C*2.x, the cell names contain the full clustering key plus the clustering key column names plus the column name of the first column in the block (which, if it happens to be an unfrozen collection, will include the collection element key).

It should work well for partitions containing large blobs though, rather than many small cql rows.

I see in C*3.x they have an "offsets map", but it is still O(n). bsearch()ing on disk is possible, but very expensive, this is why btrees were invented.

The format seems to try and avoid seeks, like it's optimizing for HDDs.

It's not avoiding seeks.

The Cassandra 2 "promoted index" format (which is that one that I described) was indeed about minimizing the number of seeks seeks:

To find a partition Cassandra always needed two seeks (one in the index file and then one in the data file), and the "promoted index" ensures that the same two seeks are enough to also find something in the middle of a long partition: You still do one seek in the index file, but then read a potentially much longer chunk. On old HHDs, the cost of the seek was much more serious than whether you end up reading 4 KB or 100 KB from that place.

Of course, because the "promoted index" size is now O(n), when the partition length continues to grow towards the absurdly large, the index size continues to grow to the point where just reading the whole chunk of the index (and parsing it in an extremely inefficient manner in Java into a million tiny IndexInfo objects) would be slower than the seek. So their solution was never perfect. But if we want to be compatible with Cassandra 2.* sstables, this is what we need to implement...

Duarte Nunes

<duarte@scylladb.com>

unread,

Jun 29, 2016, 4:55:31 AM6/29/16

to Avi Kivity, Nadav Har'El, scylladb-dev

Sorry, I meant the 2.x format. I don't have an excuse for the offsets
map. Maybe code reuse, smaller change?

Benedict Elliott Smith

<_@belliottsmith.com>

unread,

Jun 29, 2016, 6:02:11 AM6/29/16

to Duarte Nunes, Avi Kivity, Nadav Har'El, scylladb-dev

I believe this arose from the primitive nature of the caches, and the high costs associated with reading them in memory only. This is one of many Java-heap related things.

I would not try too hard to understand why origin does what it does algorithmically with its formats; they haven't been materially revisited since Cassandra was first created, and can be improved upon without a great deal of effort. Although improving on them significantly isn't a simple topic, as guaranteeing memory bounds for construction of a more efficient index isn't trivial, and so compaction could be adversely affected.

Some of my prior ideas about this can be found on CASSANDRA-7447, and specifically in this pdf although while some of the ideas remain good I think it is still much too limited in scope.

I haven't been following it too closely, but you can expect another index format change to support in this regard, that may possibly improve behaviour a little: CASSANDRA-9754

--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/1b725cae-00a6-02b9-8a3c-6c4307b35f62%40scylladb.com.

Avi Kivity

<avi@scylladb.com>

unread,

Jun 29, 2016, 6:10:22 AM6/29/16

to Duarte Nunes, Nadav Har'El, scylladb-dev

I don't see how the 2.x format avoids seeks. Sure, the index is in one
contiguous blob, and you read it with a single seek. But you avoided
the seek by reading the entire blob! Eventually the blob gets so large
reading it sequentially is more expensive than seeking around.

Consider a disk with 100 MB/s throughput and 10 ms latency.

Multiply these two numbers you get 1 MB. That's the "critical size".
Reading much more than this is more expensive than seeking, so if your
index entry is 10MB you would have been better off with two reads of 8k
btree blocks rather than the single 10MB read.

Reply all

Reply to author

Forward