[PATCH v3 2/3] sstables: coroutinize sstable::load()

2 views
Skip to first unread message

Laszlo Ersek

<laszlo.ersek@scylladb.com>
unread,
Sep 20, 2024, 2:01:03 PMSep 20
to scylladb-dev@googlegroups.com, Raphael S. Carvalho
Best viewed with "git show -b -W".

Cc: "Raphael S. Carvalho" <raph...@scylladb.com>
Ref: https://github.com/scylladb/scylladb/issues/18191
Signed-off-by: Laszlo Ersek <laszlo...@scylladb.com>
---

Notes:
v2:

- rather than unnesting the future chain, rewrite it as a coroutine
[Avi]

sstables/sstables.cc | 26 +++++++++-----------
1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/sstables/sstables.cc b/sstables/sstables.cc
index 2fbe81946845..d265586af7b4 100644
--- a/sstables/sstables.cc
+++ b/sstables/sstables.cc
@@ -1592,20 +1592,18 @@ future<> sstable::load(const dht::sharder& sharder, sstable_open_config cfg) noe

future<> sstable::load(sstables::foreign_sstable_open_info info) noexcept {
static_assert(std::is_nothrow_move_constructible_v<sstables::foreign_sstable_open_info>);
- return read_toc().then([this, info = std::move(info)] () mutable {
- _components = std::move(info.components);
- _data_file = make_checked_file(_read_error_handler, info.data.to_file());
- _index_file = make_checked_file(_read_error_handler, info.index.to_file());
- _shards = std::move(info.owners);
- _metadata_size_on_disk = info.metadata_size_on_disk;
- validate_min_max_metadata();
- validate_max_local_deletion_time();
- validate_partitioner();
- return update_info_for_opened_data().then([this]() {
- _total_reclaimable_memory.reset();
- _manager.increment_total_reclaimable_memory_and_maybe_reclaim(this);
- });
- });
+ co_await read_toc();
+ _components = std::move(info.components);
+ _data_file = make_checked_file(_read_error_handler, info.data.to_file());
+ _index_file = make_checked_file(_read_error_handler, info.index.to_file());
+ _shards = std::move(info.owners);
+ _metadata_size_on_disk = info.metadata_size_on_disk;
+ validate_min_max_metadata();
+ validate_max_local_deletion_time();
+ validate_partitioner();
+ co_await update_info_for_opened_data();
+ _total_reclaimable_memory.reset();
+ _manager.increment_total_reclaimable_memory_and_maybe_reclaim(this);
}

future<foreign_sstable_open_info> sstable::get_open_info() & {

Laszlo Ersek

<laszlo.ersek@scylladb.com>
unread,
Sep 20, 2024, 2:01:03 PMSep 20
to scylladb-dev@googlegroups.com, Raphael S. Carvalho
Ref: https://github.com/scylladb/scylladb/issues/18191
Repo: https://github.com/lersek/scylla.git
Branch: gh-18191-v3 (b26dccf36e23)
Previous: gh-18191-v2 (b8e64d2713a7)
gh-18191-v1 (f24970209fa5)
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/11787/

New feature-let, shouldn't be backported.

See the v3 updates in the Notes section of each patch.

Example:

- run scylla with

build/Dev/scylla --overprovisioned --developer-mode=yes --memory=2G --smp=3

- after scylla settles:

find $SCYLLA_HOME/data/ -type f -printf '%s\n' \
| (TOTAL=0; while read SIZE; do TOTAL=$((TOTAL+SIZE)); done; echo $TOTAL)

905720

- output from <http://localhost:9180/metrics>:

# HELP scylla_storage_bytes_occupied Bytes occupied by sstables
# TYPE scylla_storage_bytes_occupied gauge
scylla_storage_bytes_occupied{shard="0",storage_class="filesystem"} 520570.000000
scylla_storage_bytes_occupied{shard="0",storage_class="s3"} 0.000000
scylla_storage_bytes_occupied{shard="1",storage_class="filesystem"} 167833.000000
scylla_storage_bytes_occupied{shard="1",storage_class="s3"} 0.000000
scylla_storage_bytes_occupied{shard="2",storage_class="filesystem"} 217317.000000
scylla_storage_bytes_occupied{shard="2",storage_class="s3"} 0.000000

- 905720 = 520570 + 167833 + 217317

Cc: "Raphael S. Carvalho" <raph...@scylladb.com>

Thanks
Laszlo

Laszlo Ersek (3):
sstables/storage: add funcs to track used space per storage type per
shard
sstables: coroutinize sstable::load()
sstables: track used space per storage type per shard

main.cc | 1 +
sstables/sstables.cc | 52 ++++++++++++++------
sstables/sstables.hh | 2 +
sstables/storage.cc | 25 +++++++++-
sstables/storage.hh | 11 +++++
5 files changed, 75 insertions(+), 16 deletions(-)


base-commit: 61d19e4464c2d36dcebb7c50aec95fa863bfb207

Laszlo Ersek

<laszlo.ersek@scylladb.com>
unread,
Sep 20, 2024, 2:01:04 PMSep 20
to scylladb-dev@googlegroups.com, Raphael S. Carvalho
Upon scylla startup, I have identified the following four sstable method
call chains:

- discovering an existent sstable (potentially deleting it in the end):

sstable ctor -> load -> close_files -> [unlink]

- creating a new sstable (potentially deleting it in the end):

sstable ctor -> open_sstable -> seal_sstable -> close_files -> [unlink]

On the sstable "read path", credit the storage in load() and debit it in
unlink(). On the "write path", credit the storage in seal_sstable() and
debit it in unlink().

Define the current storage usage of an sstable on this shard as:

- zero, if the sstable is shared, or unshared but foreign,

- as the sum of its recognized components' sizes, otherwise.

In the latter case, "_recognized_components" is populated

- on the read path

- in the 1st load variant via: load -> load_metadata -> read_toc,

- in the 2nd load variant via: load -> read_toc,

- on the write path via: open_sstable -> generate_toc.

I can't use the existent bytes_on_disk() function for this purpose.
Namely, since at least commit 2dbae856f84e ("sstable: Piggyback on sstable
parser and writer to provide bytes_on_disk", 2023-04-27), bytes_on_disk()
insists on "_data_file_size" and "_index_file_size" having been set (to
nonzero values). However, at least "_data_file_size" may not be set when
we call bytes_on_disk() from seal_sstable(), for example.

Instead, stat each recognized component afresh whenever the sst's storage
footprint is requested for updating the metric.

The loop in the new bytes_on_disk_uncached() function is similar to the
loop removed in commit 2dbae856f84e. One difference is that the new one is
not specific to "filesystem_storage", i.e., the reactor's file_stat()
function. Instead, open each component using the sstable's appropriate
storage backend, and then call the storage-specific file::stat() member
for fetching the size. For S3, this appears to be satisfied by
s3::client::readable_file::stat().

Cc: "Raphael S. Carvalho" <raph...@scylladb.com>

Notes:
v3:

- track used space per storage type per shard [Avi]

- rename bytes_on_disk_uncached() to bytes_on_storage_uncached()

- make bytes_on_storage_uncached() shard-aware: return zero bytes for
shared and foreign sstables

- refresh commit message

v2:

- replace "co_await ... finally" with "co_await with_closeable" [Avi]

sstables/sstables.hh | 2 ++
sstables/sstables.cc | 26 ++++++++++++++++++++
2 files changed, 28 insertions(+)

diff --git a/sstables/sstables.hh b/sstables/sstables.hh
index 0b18acd2b637..fb8824eabef9 100644
--- a/sstables/sstables.hh
+++ b/sstables/sstables.hh
@@ -710,6 +710,8 @@ class sstable : public enable_lw_shared_from_this<sstable> {
// Reload components from which memory was previously reclaimed
future<> reload_reclaimed_components();

+ future<uint64_t> bytes_on_storage_uncached() const noexcept;
+
public:
// Finds first position_in_partition in a given partition.
// If reversed is false, then the first position is actually the first row (can be the static one).
diff --git a/sstables/sstables.cc b/sstables/sstables.cc
index d265586af7b4..f404d66b3f6b 100644
--- a/sstables/sstables.cc
+++ b/sstables/sstables.cc
@@ -1559,6 +1559,27 @@ future<> sstable::reload_reclaimed_components() {
sstlog.info("Reloaded bloom filter of {}", get_filename());
}

+future<uint64_t> sstable::bytes_on_storage_uncached() const noexcept {
+ uint64_t bytes = 0;
+
+ if (_shards.size() != 1 || _shards[0] != this_shard_id()) {
+ co_return bytes;
+ }
+
+ co_await coroutine::parallel_for_each(_recognized_components, [this, &bytes] (const component_type& type) -> future<> {
+ try {
+ file f = co_await _storage->open_component(*this, type, open_flags::ro, file_open_options{}, false);
+ struct stat st = co_await with_closeable(std::move(f), [] (file& f) {
+ return f.stat();
+ });
+ bytes += st.st_size;
+ } catch (...) {
+ sstlog.warn("failed to get size of {} for storage usage metric: {}", filename(type), std::current_exception());
+ }
+ });
+ co_return bytes;
+}
+
future<> sstable::load_metadata(sstable_open_config cfg, bool validate) noexcept {
co_await read_toc();
// read scylla-meta after toc. Might need it to parse
@@ -1588,6 +1609,7 @@ future<> sstable::load(const dht::sharder& sharder, sstable_open_config cfg) noe
std::vector<unsigned>{this_shard_id()} : compute_shards_for_this_sstable(sharder);
}
co_await open_data(cfg);
+ _storage->credit_bytes(co_await bytes_on_storage_uncached());


}

future<> sstable::load(sstables::foreign_sstable_open_info info) noexcept {

@@ -1602,6 +1624,7 @@ future<> sstable::load(sstables::foreign_sstable_open_info info) noexcept {
validate_max_local_deletion_time();
validate_partitioner();
co_await update_info_for_opened_data();
+ _storage->credit_bytes(co_await bytes_on_storage_uncached());
_total_reclaimable_memory.reset();
_manager.increment_total_reclaimable_memory_and_maybe_reclaim(this);
}
@@ -1902,6 +1925,7 @@ bool sstable::may_contain_rows(const query::clustering_row_ranges& ranges) const
future<> sstable::seal_sstable(bool backup)
{
co_await _storage->seal(*this);
+ _storage->credit_bytes(co_await bytes_on_storage_uncached());
if (_marked_for_deletion == mark_for_deletion::implicit) {
_marked_for_deletion = mark_for_deletion::none;
}
@@ -3053,6 +3077,7 @@ future<>
sstable::unlink(storage::sync_dir sync) noexcept {
_on_delete(*this);

+ uint64_t bytes = co_await bytes_on_storage_uncached();
auto remove_fut = _storage->wipe(*this, sync);

try {
@@ -3066,6 +3091,7 @@ sstable::unlink(storage::sync_dir sync) noexcept {

co_await std::move(remove_fut);
_stats.on_delete();
+ _storage->debit_bytes(bytes);
_manager.on_unlink(this);
}

Benny Halevy

<bhalevy@scylladb.com>
unread,
Sep 22, 2024, 3:04:57 AMSep 22
to Laszlo Ersek, scylladb-dev@googlegroups.com, Raphael S. Carvalho
LGTM.

Can be sent independently.

Benny

Benny Halevy

<bhalevy@scylladb.com>
unread,
Sep 22, 2024, 3:13:39 AMSep 22
to Laszlo Ersek, scylladb-dev@googlegroups.com, Raphael S. Carvalho
On Fri, 2024-09-20 at 19:03 +0200, 'Laszlo Ersek' via ScyllaDB development wrote:
> Upon scylla startup, I have identified the following four sstable method
> call chains:
>
> - discovering an existent sstable (potentially deleting it in the end):
>
>   sstable ctor -> load -> close_files -> [unlink]
>
> - creating a new sstable (potentially deleting it in the end):
>
>   sstable ctor -> open_sstable -> seal_sstable -> close_files -> [unlink]
>
> On the sstable "read path", credit the storage in load() and debit it in
> unlink(). On the "write path", credit the storage in seal_sstable() and
> debit it in unlink().
>
> Define the current storage usage of an sstable on this shard as:
>
> - zero, if the sstable is shared, or unshared but foreign,
>
> - as the sum of its recognized components' sizes, otherwise.
>
> In the latter case, "_recognized_components" is populated
>
> - on the read path
>
>   - in the 1st load variant via: load -> load_metadata -> read_toc,
>
>   - in the 2nd load variant via: load -> read_toc,
>
> - on the write path via: open_sstable -> generate_toc.
>
> I can't use the existent bytes_on_disk() function for this purpose.
> Namely, since at least commit 2dbae856f84e ("sstable: Piggyback on sstable
> parser and writer to provide bytes_on_disk", 2023-04-27), bytes_on_disk()
> insists on "_data_file_size" and "_index_file_size" having been set (to
> nonzero values). However, at least "_data_file_size" may not be set when
> we call bytes_on_disk() from seal_sstable(), for example.

It looks like tracking the per-storage-type stats per sstable
would be much simpler than trying to maintain a global, thread-local, tally.
Looking forward to tiering, some components of an sstable may be stored
on the local file system (those that contain metadata) and some may be stored
on object storage (the data component, and maybe the index too).
So the sstable object is in the best position to tell how much data
it stores on each storage type.
Those stats can be summarized into the table stats very much like
live_disk_space_used and total_disk_space_used which are updated
in table::update_stats_for_new_sstable / rebuild_statistics /
subtract_compaction_group_from_stats.

Laszlo Ersek

<laszlo.ersek@scylladb.com>
unread,
Sep 23, 2024, 8:07:43 AMSep 23
to Benny Halevy, scylladb-dev@googlegroups.com, Raphael S. Carvalho
My understanding was that Raphael specifically needed per-shard sums.

> Looking forward to tiering, some components of an sstable may be stored
> on the local file system (those that contain metadata) and some may be stored
> on object storage (the data component, and maybe the index too).

Ah, I didn't know about that.

That feature will uproot the sstable<->storage interfaces; the single
"_storage" member in sstable, and the "flat" open_component() member
function in storage, will no longer fit. If storage needs to be pushed
under component, then isn't it better to just postpone this new metric
until then?

> So the sstable object is in the best position to tell how much data
> it stores on each storage type.

Metadata size tracking is already understandable to me on both the read
and write paths. What seems elusive is how the index file size and the
data file size are -- or *should be* -- tracked on the write path. They
don't seem to be accounted for when a new sstable is written out, and
I've not been able to determine where exactly to do it.

Perhaps I should look at make_data_or_index_sink() and "mx/writer.cc".
It seems like a good chunk of the write path is in "mx/writer.cc", but I
don't understand the separation.

> Those stats can be summarized into the table stats very much like
> live_disk_space_used and total_disk_space_used which are updated
> in table::update_stats_for_new_sstable / rebuild_statistics /
> subtract_compaction_group_from_stats.

Those stats don't seem to be per-shard (which I understood was what
Raphael was after).

Laszlo

Laszlo Ersek

<laszlo.ersek@scylladb.com>
unread,
Sep 25, 2024, 8:46:02 AMSep 25
to Benny Halevy, Raphael S. Carvalho, scylladb-dev@googlegroups.com
On 9/22/24 09:13, Benny Halevy wrote:
Thank you, Benny, for your explanation above. I first found it hard to
understand, but that was because (it turns out) my mental image of
sharding of tables was completely off.

I had failed to put together

https://docs.google.com/presentation/d/e/2PACX-1vQp4_YA3OgkKuiM6pvwB0DCJnTyq6d7pXoZBP1j90bamWIemwiLQPeyC66b_y60xW1poUI-p7PuDLnU/embed?pli=1#slide=id.g2d2684e7848_0_171

with table::update_stats_for_new_sstable() in "replica/table.cc".

At <http://localhost:9180/metrics> I realized today that the
"scylla_column_family_live_disk_space" metric, from
table::set_metrics(), was already sharded (entries carried shard
labels).

Turns out a "table" object is constructed for each
"keyspace.column_family" *on each shard* already. This means that
"_stats.live_disk_space_used" is already per-shard -- which implies

sst->bytes_on_disk()

must be per-shard, too.

It explains why I needed the workaround

if (_shards.size() != 1 || _shards[0] != this_shard_id()) {
co_return bytes;
}

in sstable::bytes_on_storage_uncached() [sstables/sstables.cc].

With this, it seems that nothing needs to be done for
<https://github.com/scylladb/scylladb/issues/18191>. For the storage
usage on shard N, sum the metrics selected by

scylla_column_family_total_disk_space{cf=*,ks=*,shard=N}

... In retrospect I don't understand why #18191 was opened at all (on
April 4th, 2024); when the current logic seems to have been in place
since at least commit 2dbae856f84e ("sstable: Piggyback on sstable
parser and writer to provide bytes_on_disk", 2023-04-27).

Here's what I'm going to do:

- resubmit the middle patch in isolation, without reference to any
issue#

- abandon the rest of the patch set

- close #18191 with a comment about summing
"scylla_column_family_total_disk_space".

In theory, we could subdivide the currently existing stats per storage
type, but (a) that was never in the original scope of #18191, (b)
sstable::bytes_on_disk() will have to be reworked anyway, once tiering
comes along (i.e., when storage type becomes a function of sstable
component).

Laszlo

Benny Halevy

<bhalevy@scylladb.com>
unread,
Sep 26, 2024, 9:53:33 AMSep 26
to Laszlo Ersek, Kefu Chai, Raphael S. Carvalho, Pavel Emelyanov, scylladb-dev@googlegroups.com
If we don't need it urgently, we can wait until we complete the design
phase of TWCS-base tiering and have at least a high level design of
how to manage tiered SSTables.

It is likely we'll need to break the current model of storage
so to still have object storage configuration per keyspace
but track the status of each component, or class of components
(say they could be categorized into "metadata" that includes the TOC, summary, filter, statistics, scylla metadata etc.,
"Index", and "Data", and then the sstable could have a tiering-level that determines which components
are on object storage and which are local (and very likely on object storage as well, for simple snapshotting)


> > So the sstable object is in the best position to tell how much data
> > it stores on each storage type.
>
> Metadata size tracking is already understandable to me on both the read
> and write paths. What seems elusive is how the index file size and the
> data file size are -- or *should be* -- tracked on the write path. They
> don't seem to be accounted for when a new sstable is written out, and
> I've not been able to determine where exactly to do it.

When the sstable is sealed we should probably have all components written and closed
so we can query their capacity/size.

> Perhaps I should look at make_data_or_index_sink() and "mx/writer.cc".
> It seems like a good chunk of the write path is in "mx/writer.cc", but I
> don't understand the separation.
>
> > Those stats can be summarized into the table stats very much like
> > live_disk_space_used and total_disk_space_used which are updated
> > in table::update_stats_for_new_sstable / rebuild_statistics /
> > subtract_compaction_group_from_stats.
>
> Those stats don't seem to be per-shard (which I understood was what
> Raphael was after).

They are if I'm not mistaken. The table is sharded, so the table object instance
is per shard. It sums only the sstables that shard owns.

>
> Laszlo

Reply all
Reply to author
Forward
0 new messages