[RFC PATCH 01/34] fs: prepare fs/ directory and conditional compilation

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:18 AM4/20/20

to seastar-dev@googlegroups.com, Piotr Sarna, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

From: Piotr Sarna <sa...@scylladb.com>

This patch provides the initial infrastructure for future
SeastarFS (Seastar filesystem) patches.
Since the project is in very early stage and is going to require
C++17 features, it's disabled by default and can only be enabled
manually by configuring with --enable-experimental-fs
or defining a CMake flag -DSeastar_EXPERIMENTAL_FS=ON.
---
configure.py | 6 ++++++
CMakeLists.txt | 12 ++++++++++++
src/fs/README.md | 10 ++++++++++
3 files changed, 28 insertions(+)
create mode 100644 src/fs/README.md

diff --git a/configure.py b/configure.py
index bbc9f908..6ec350ef 100755
--- a/configure.py
+++ b/configure.py
@@ -106,6 +106,11 @@ add_tristate(
name = 'unused-result-error',
dest = "unused_result_error",
help = 'Make [[nodiscard]] violations an error')
+add_tristate(
+ arg_parser,
+ name = 'experimental-fs',
+ dest = "experimental_fs",
+ help = 'experimental support for SeastarFS')
arg_parser.add_argument('--allocator-page-size', dest='alloc_page_size', type=int, help='override allocator page size')
arg_parser.add_argument('--without-tests', dest='exclude_tests', action='store_true', help='Do not build tests by default')
arg_parser.add_argument('--without-apps', dest='exclude_apps', action='store_true', help='Do not build applications by default')
@@ -201,6 +206,7 @@ def configure_mode(mode):
tr(args.heap_profiling, 'HEAP_PROFILING'),
tr(args.coroutines_ts, 'EXPERIMENTAL_COROUTINES_TS'),
tr(args.unused_result_error, 'UNUSED_RESULT_ERROR'),
+ tr(args.experimental_fs, 'EXPERIMENTAL_FS'),
]

ingredients_to_cook = set(args.cook)
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 39ae46aa..be4f02c8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -281,6 +281,10 @@ set (Seastar_STACK_GUARDS
STRING
"Enable stack guards. Can be ON, OFF or DEFAULT (which enables it for non release builds)")

+option (Seastar_EXPERIMENTAL_FS
+ "Compile experimental SeastarFS sources (requires C++17 support)"
+ OFF)
+
# When Seastar is embedded with `add_subdirectory`, disable the non-library targets.
if (NOT (CMAKE_CURRENT_SOURCE_DIR STREQUAL CMAKE_SOURCE_DIR))
set (Seastar_APPS OFF)
@@ -648,6 +652,14 @@ add_library (seastar STATIC

add_library (Seastar::seastar ALIAS seastar)

+if (Seastar_EXPERIMENTAL_FS)
+ message(STATUS "Experimental SeastarFS is enabled")
+ target_sources(seastar
+ PRIVATE
+ # SeastarFS source files
+ )
+endif()
+
add_dependencies (seastar
seastar_http_request_parser
seastar_http_response_parser
diff --git a/src/fs/README.md b/src/fs/README.md
new file mode 100644
index 00000000..630f34a8
--- /dev/null
+++ b/src/fs/README.md
@@ -0,0 +1,10 @@
+### SeastarFS ###
+
+SeastarFS is an R&D project aimed at providing a fully asynchronous,
+log-structured, shard-friendly file system optimized for large files
+and with native Seastar support.
+
+Source files residing in this directory will be compiled only
+by setting an appropriate flag.
+ninja: ./configure.py --enable-experimental-fs
+CMake: -DSeastar\_EXPERIMENTAL\_FS=ON
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:18 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

github: https://github.com/psarna/seastar/commits/fs-metadata-log

This series is part of the ZPP FS project that is coordinated by Piotr Sarna <sa...@scylladb.com>.
The goal of this project is to create SeastarFS -- a fully asynchronous, sharded, user-space,
log-structured file system that is intended to become an alternative to XFS for Scylla.

The filesystem is optimized for:
- NVMe SSD storage
- large files
- appending files

For efficiency all metadata is stored in the memory. Metadata holds information about where each
part of the file is located and about the directory tree.

Whole filesystem is divided into filesystem shards (typically the same as number of seastar shards
for efficiency). Each shard holds its own part of the filesystem. Sharding is done by the fact that
each shard has its set of root subdirectories that it exclusively owns (one of the shards is an
owner of the root directory itself).

Every shard will have 3 private logs:
- metadata log -- holds metadata and very small writes
- medium data log -- holds medium-sized writes, which can combine data from different files
- big data log -- holds data clusters, each of which belongs to a single file (this is not
actually a log, but in the big picture it looks like it was)

Disk space is divided into clusters (typically around several MiB) that
have all equal size that is multiple of alignment (typically 4096
bytes). Each shard has its private pool of clusters (assignment is
stored in bootstrap record). Each log consumes clusters one by one -- it
writes the current one and if cluster becomes full, then log switches to
a new one that is obtained from a pool of free clusters managed by
cluster_allocator. Metadata log and medium data log write data in the
same manner: they fill up the cluster gradually from left to right. Big
data log takes a cluster and completely fills it with data at once -- it
is only used during big writes.

metadata_log is in fact a standalone file system instance that provides lower level interface
(paths and inodes) of shard's own part of the filesystem. It manages all 3 logs mentioned above and
maintains all metadata about its part of the file system that include data structures for directory
structure and file content, locking logic for safe concurrent usage, buffers for writing logs to
disk, and bootstrapping -- restoring file system structure from disk.

This patch implements:
- bootstrap record -- our equivalent of the filesystem superblock -- it contains information like
size of the block device, number of filesystem shards, cluster distribution among shards
- cluster allocator for managing free clusters within one metadata_log
- fully functional metadata_log that will be one shard's part of the filesystem
- bootstrapping metadata_log
- creating / deleting file and directories (+ support for unlinked files)
- reading, writing and truncating files
- opening and closing files
- linking files (but not directories)
- iterating directory and getting file attributes
- tests of some components and functionality of the metadata_log and bootstrap record

What is not here, but we plan on pushing it later:
- compaction
- filesystem sharding
- renaming files

Tests: unit(dev)

Aleksander Sorokin (3):
fs: add initial file implementation
tests: fs: add parallel i/o unit test for seastarfs file
tests: fs: add basic test for metadata log bootstrapping

Krzysztof Małysa (14):
fs: add initial block_device implementation
fs: add temporary_file
tests: fs: add block_device unit test
fs: add unit headers
fs: add seastar/fs/overloaded.hh
fs: add seastar/fs/path.hh with unit tests
fs: add value_shared_lock.hh
fs: metadata_log: add base implementation
fs: metadata_log: add operation for creating and opening unlinked file
fs: metadata_log: add creating files and directories
fs: metadata_log: add private operation for deleting inode
fs: metadata_log: add link operation
fs: metadata_log: add unlinking files and removing directories
fs: metadata_log: add stat() operation

Michał Niciejewski (10):
fs: add bootstrap record implementation
tests: fs: add tests for bootstrap record
fs: metadata_log: add opening file
fs: metadata_log: add closing file
fs: metadata_log: add write operation
fs: metadata_log: add read operation
tests: fs: added metadata_to_disk_buffer and cluster_writer mockers
tests: fs: add write test
fs: read: add optimization for aligned reads
tests: fs: add tests for aligned reads and writes

Piotr Sarna (1):
fs: prepare fs/ directory and conditional compilation

Wojciech Mitros (6):
fs: add cluster allocator
fs: add cluster allocator tests
fs: metadata_log: add truncate operation
tests: fs: add to_disk_buffer test
tests: fs: add truncate operation test
tests: fs: add metadata_to_disk_buffer unit tests

configure.py | 6 +
include/seastar/fs/block_device.hh | 102 +++
include/seastar/fs/exceptions.hh | 88 ++
include/seastar/fs/file.hh | 55 ++
include/seastar/fs/overloaded.hh | 26 +
include/seastar/fs/stat.hh | 41 +
include/seastar/fs/temporary_file.hh | 54 ++
src/fs/bitwise.hh | 125 +++
src/fs/bootstrap_record.hh | 98 ++
src/fs/cluster.hh | 42 +
src/fs/cluster_allocator.hh | 50 ++
src/fs/cluster_writer.hh | 85 ++
src/fs/crc.hh | 34 +
src/fs/device_reader.hh | 91 ++
src/fs/inode.hh | 80 ++
src/fs/inode_info.hh | 221 +++++
src/fs/metadata_disk_entries.hh | 208 +++++
src/fs/metadata_log.hh | 362 ++++++++
src/fs/metadata_log_bootstrap.hh | 145 +++
.../create_and_open_unlinked_file.hh | 77 ++
src/fs/metadata_log_operations/create_file.hh | 174 ++++
src/fs/metadata_log_operations/link_file.hh | 112 +++
src/fs/metadata_log_operations/read.hh | 138 +++
src/fs/metadata_log_operations/truncate.hh | 90 ++
.../unlink_or_remove_file.hh | 196 ++++
src/fs/metadata_log_operations/write.hh | 318 +++++++
src/fs/metadata_to_disk_buffer.hh | 244 +++++
src/fs/path.hh | 42 +
src/fs/range.hh | 61 ++
src/fs/to_disk_buffer.hh | 138 +++
src/fs/units.hh | 40 +
src/fs/unix_metadata.hh | 40 +
src/fs/value_shared_lock.hh | 65 ++
tests/unit/fs_metadata_common.hh | 467 ++++++++++
tests/unit/fs_mock_block_device.hh | 55 ++
tests/unit/fs_mock_cluster_writer.hh | 78 ++
tests/unit/fs_mock_metadata_to_disk_buffer.hh | 323 +++++++
src/fs/bootstrap_record.cc | 206 +++++
src/fs/cluster_allocator.cc | 54 ++
src/fs/device_reader.cc | 199 +++++
src/fs/file.cc | 108 +++
src/fs/metadata_log.cc | 525 +++++++++++
src/fs/metadata_log_bootstrap.cc | 552 ++++++++++++
tests/unit/fs_block_device_test.cc | 206 +++++
tests/unit/fs_bootstrap_record_test.cc | 414 +++++++++
tests/unit/fs_cluster_allocator_test.cc | 115 +++
tests/unit/fs_log_bootstrap_test.cc | 86 ++
tests/unit/fs_metadata_to_disk_buffer_test.cc | 462 ++++++++++
tests/unit/fs_mock_block_device.cc | 50 ++
.../fs_mock_metadata_to_disk_buffer_test.cc | 357 ++++++++
tests/unit/fs_path_test.cc | 90 ++
tests/unit/fs_seastarfs_test.cc | 62 ++
tests/unit/fs_to_disk_buffer_test.cc | 160 ++++
tests/unit/fs_truncate_test.cc | 171 ++++
tests/unit/fs_write_test.cc | 835 ++++++++++++++++++
CMakeLists.txt | 50 ++
src/fs/README.md | 10 +
tests/unit/CMakeLists.txt | 42 +
58 files changed, 9325 insertions(+)
create mode 100644 include/seastar/fs/block_device.hh
create mode 100644 include/seastar/fs/exceptions.hh
create mode 100644 include/seastar/fs/file.hh
create mode 100644 include/seastar/fs/overloaded.hh
create mode 100644 include/seastar/fs/stat.hh
create mode 100644 include/seastar/fs/temporary_file.hh
create mode 100644 src/fs/bitwise.hh
create mode 100644 src/fs/bootstrap_record.hh
create mode 100644 src/fs/cluster.hh
create mode 100644 src/fs/cluster_allocator.hh
create mode 100644 src/fs/cluster_writer.hh
create mode 100644 src/fs/crc.hh
create mode 100644 src/fs/device_reader.hh
create mode 100644 src/fs/inode.hh
create mode 100644 src/fs/inode_info.hh
create mode 100644 src/fs/metadata_disk_entries.hh
create mode 100644 src/fs/metadata_log.hh
create mode 100644 src/fs/metadata_log_bootstrap.hh
create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
create mode 100644 src/fs/metadata_log_operations/create_file.hh
create mode 100644 src/fs/metadata_log_operations/link_file.hh
create mode 100644 src/fs/metadata_log_operations/read.hh
create mode 100644 src/fs/metadata_log_operations/truncate.hh
create mode 100644 src/fs/metadata_log_operations/unlink_or_remove_file.hh
create mode 100644 src/fs/metadata_log_operations/write.hh
create mode 100644 src/fs/metadata_to_disk_buffer.hh
create mode 100644 src/fs/path.hh
create mode 100644 src/fs/range.hh
create mode 100644 src/fs/to_disk_buffer.hh
create mode 100644 src/fs/units.hh
create mode 100644 src/fs/unix_metadata.hh
create mode 100644 src/fs/value_shared_lock.hh
create mode 100644 tests/unit/fs_metadata_common.hh
create mode 100644 tests/unit/fs_mock_block_device.hh
create mode 100644 tests/unit/fs_mock_cluster_writer.hh
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer.hh
create mode 100644 src/fs/bootstrap_record.cc
create mode 100644 src/fs/cluster_allocator.cc
create mode 100644 src/fs/device_reader.cc
create mode 100644 src/fs/file.cc
create mode 100644 src/fs/metadata_log.cc
create mode 100644 src/fs/metadata_log_bootstrap.cc
create mode 100644 tests/unit/fs_block_device_test.cc
create mode 100644 tests/unit/fs_bootstrap_record_test.cc
create mode 100644 tests/unit/fs_cluster_allocator_test.cc
create mode 100644 tests/unit/fs_log_bootstrap_test.cc
create mode 100644 tests/unit/fs_metadata_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_mock_block_device.cc
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_path_test.cc
create mode 100644 tests/unit/fs_seastarfs_test.cc
create mode 100644 tests/unit/fs_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_truncate_test.cc
create mode 100644 tests/unit/fs_write_test.cc
create mode 100644 src/fs/README.md

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:19 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

block_device is an abstraction over an opened block device file or
opened ordinary file of fixed size. It offers:
- openning and closing file (block device)
- aligned reads and writes
- flushing

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/block_device.hh | 102 +++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 103 insertions(+)
create mode 100644 include/seastar/fs/block_device.hh

diff --git a/include/seastar/fs/block_device.hh b/include/seastar/fs/block_device.hh
new file mode 100644
index 00000000..31822617
--- /dev/null
+++ b/include/seastar/fs/block_device.hh
@@ -0,0 +1,102 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/file.hh"
+#include "seastar/core/reactor.hh"
+
+namespace seastar::fs {
+
+class block_device_impl {
+public:
+ virtual ~block_device_impl() = default;
+
+ virtual future<size_t> write(uint64_t aligned_pos, const void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) = 0;
+ virtual future<size_t> read(uint64_t aligned_pos, void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) = 0;
+ virtual future<> flush() = 0;
+ virtual future<> close() = 0;
+};
+
+class block_device {
+ shared_ptr<block_device_impl> _block_device_impl;
+public:
+ block_device(shared_ptr<block_device_impl> impl) noexcept : _block_device_impl(std::move(impl)) {}
+
+ block_device() = default;
+
+ block_device(const block_device&) = default;
+ block_device(block_device&&) noexcept = default;
+ block_device& operator=(const block_device&) noexcept = default;
+ block_device& operator=(block_device&&) noexcept = default;
+
+ explicit operator bool() const noexcept { return bool(_block_device_impl); }
+
+ template <typename CharType>
+ future<size_t> read(uint64_t aligned_offset, CharType* aligned_buffer, size_t aligned_len, const io_priority_class& pc = default_priority_class()) {
+ return _block_device_impl->read(aligned_offset, aligned_buffer, aligned_len, pc);
+ }
+
+ template <typename CharType>
+ future<size_t> write(uint64_t aligned_offset, const CharType* aligned_buffer, size_t aligned_len, const io_priority_class& pc = default_priority_class()) {
+ return _block_device_impl->write(aligned_offset, aligned_buffer, aligned_len, pc);
+ }
+
+ future<> flush() {
+ return _block_device_impl->flush();
+ }
+
+ future<> close() {
+ return _block_device_impl->close();
+ }
+};
+
+class file_block_device_impl : public block_device_impl {
+ file _file;
+public:
+ explicit file_block_device_impl(file f) : _file(std::move(f)) {}
+
+ ~file_block_device_impl() override = default;
+
+ future<size_t> write(uint64_t aligned_pos, const void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) override {
+ return _file.dma_write(aligned_pos, aligned_buffer, aligned_len, pc);
+ }
+
+ future<size_t> read(uint64_t aligned_pos, void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) override {
+ return _file.dma_read(aligned_pos, aligned_buffer, aligned_len, pc);
+ }
+
+ future<> flush() override {
+ return _file.flush();
+ }
+
+ future<> close() override {
+ return _file.close();
+ }
+};
+
+inline future<block_device> open_block_device(std::string name) {
+ return open_file_dma(std::move(name), open_flags::rw).then([](file f) {
+ return block_device(make_shared<file_block_device_impl>(std::move(f)));
+ });
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index be4f02c8..b50abf99 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -657,6 +657,7 @@ if (Seastar_EXPERIMENTAL_FS)
target_sources(seastar
PRIVATE
# SeastarFS source files
+ include/seastar/fs/block_device.hh
)
endif()

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:21 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

temporary_file is a handle to a temprorary file with a path.
It creates temporary file upon construction and removes it upon
destruction.
The main use case is testing the file system on a temporary file that
simulates a block device.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

include/seastar/fs/temporary_file.hh | 54 ++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 55 insertions(+)
create mode 100644 include/seastar/fs/temporary_file.hh

diff --git a/include/seastar/fs/temporary_file.hh b/include/seastar/fs/temporary_file.hh
new file mode 100644
index 00000000..c00282d9
--- /dev/null
+++ b/include/seastar/fs/temporary_file.hh
@@ -0,0 +1,54 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include "seastar/core/posix.hh"
+
+#include <string>

+
+namespace seastar::fs {
+

+class temporary_file {
+ std::string _path;
+
+public:
+ explicit temporary_file(std::string path) : _path(std::move(path) + ".XXXXXX") {
+ int fd = mkstemp(_path.data());
+ throw_system_error_on(fd == -1);
+ close(fd);
+ }
+
+ ~temporary_file() {
+ unlink(_path.data());
+ }
+
+ temporary_file(const temporary_file&) = delete;
+ temporary_file& operator=(const temporary_file&) = delete;
+ temporary_file(temporary_file&&) noexcept = delete;
+ temporary_file& operator=(temporary_file&&) noexcept = delete;
+
+ const std::string& path() const noexcept {
+ return _path;
+ }
+};
+
+} // namespace seastar::fs
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 0ba7ee35..39d11ad8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -659,6 +659,7 @@ if (Seastar_EXPERIMENTAL_FS)
# SeastarFS source files
include/seastar/fs/block_device.hh
include/seastar/fs/file.hh
+ include/seastar/fs/temporary_file.hh
src/fs/file.cc
)
endif()
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:21 AM4/20/20

to seastar-dev@googlegroups.com, Aleksander Sorokin, sarna@scylladb.com, quport@gmail.com, wmitros@protonmail.com

From: Aleksander Sorokin <ank...@gmail.com>

Currently the only implementation of Seastar's file abstraction is
`posix_file_impl`. This patch provides another implementation, which keeps
a reference to our file system's metadata in it and uses the `block_device`
handle underneath. Implemented `seastarfs_file_impl`, which derives from
`file_impl` and provides a stub interface. At the moment it’s extremely
oversimplified and just treat the whole block device as a one huge file.
Along with it, provided a free function for creating this handle.

Signed-off-by: Aleksander Sorokin <ank...@gmail.com>
---
include/seastar/fs/file.hh | 55 +++++++++++++++++++
src/fs/file.cc | 108 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 2 +
3 files changed, 165 insertions(+)
create mode 100644 include/seastar/fs/file.hh
create mode 100644 src/fs/file.cc

diff --git a/include/seastar/fs/file.hh b/include/seastar/fs/file.hh
new file mode 100644
index 00000000..ae38f3a4
--- /dev/null
+++ b/include/seastar/fs/file.hh
@@ -0,0 +1,55 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include "seastar/core/file.hh"
+#include "seastar/core/future.hh"
+#include "seastar/fs/block_device.hh"

+
+namespace seastar::fs {
+

+class seastarfs_file_impl : public file_impl {
+ block_device _block_device;
+ open_flags _open_flags;
+public:
+ seastarfs_file_impl(block_device dev, open_flags flags);
+ ~seastarfs_file_impl() override = default;
+
+ future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override;
+ future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override;
+ future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override;
+ future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override;
+ future<> flush() override;
+ future<struct stat> stat() override;
+ future<> truncate(uint64_t length) override;
+ future<> discard(uint64_t offset, uint64_t length) override;
+ future<> allocate(uint64_t position, uint64_t length) override;
+ future<uint64_t> size() override;
+ future<> close() noexcept override;
+ std::unique_ptr<file_handle_impl> dup() override;
+ subscription<directory_entry> list_directory(std::function<future<> (directory_entry de)> next) override;
+ future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) override;
+};
+
+future<file> open_file_dma(std::string name, open_flags flags);
+
+}
diff --git a/src/fs/file.cc b/src/fs/file.cc
new file mode 100644
index 00000000..4f4e0ac6
--- /dev/null
+++ b/src/fs/file.cc
@@ -0,0 +1,108 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#include "seastar/core/future.hh"
+#include "seastar/fs/block_device.hh"
+#include "seastar/fs/file.hh"

+
+namespace seastar::fs {
+

+seastarfs_file_impl::seastarfs_file_impl(block_device dev, open_flags flags)
+ : _block_device(std::move(dev))
+ , _open_flags(flags) {}
+
+future<size_t>
+seastarfs_file_impl::write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) {
+ return _block_device.write(pos, buffer, len, pc);
+}
+
+future<size_t>
+seastarfs_file_impl::write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<size_t>
+seastarfs_file_impl::read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) {
+ return _block_device.read(pos, buffer, len, pc);
+}
+
+future<size_t>
+seastarfs_file_impl::read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::flush() {
+ return _block_device.flush();
+}
+
+future<struct stat>
+seastarfs_file_impl::stat() {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::truncate(uint64_t) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::discard(uint64_t offset, uint64_t length) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::allocate(uint64_t position, uint64_t length) {
+ throw std::bad_function_call();
+}
+
+future<uint64_t>
+seastarfs_file_impl::size() {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::close() noexcept {
+ return _block_device.close();
+}
+
+std::unique_ptr<file_handle_impl>
+seastarfs_file_impl::dup() {
+ throw std::bad_function_call();
+}
+
+subscription<directory_entry>
+seastarfs_file_impl::list_directory(std::function<future<> (directory_entry de)> next) {
+ throw std::bad_function_call();
+}
+
+future<temporary_buffer<uint8_t>>
+seastarfs_file_impl::dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<file> open_file_dma(std::string name, open_flags flags) {
+ return open_block_device(std::move(name)).then([flags] (block_device bd) {
+ return file(make_shared<seastarfs_file_impl>(std::move(bd), flags));

+ });
+}
+
+}

diff --git a/CMakeLists.txt b/CMakeLists.txt
index b50abf99..0ba7ee35 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -658,6 +658,8 @@ if (Seastar_EXPERIMENTAL_FS)
PRIVATE

# SeastarFS source files
include/seastar/fs/block_device.hh

+ include/seastar/fs/file.hh
+ src/fs/file.cc
)
endif()

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:22 AM4/20/20

to seastar-dev@googlegroups.com, Aleksander Sorokin, sarna@scylladb.com, quport@gmail.com, wmitros@protonmail.com

From: Aleksander Sorokin <ank...@gmail.com>

Added first crude unit test for seastarfs_file_impl:
paralleel writing with newly created handle and reading the same data back.

Signed-off-by: Aleksander Sorokin <ank...@gmail.com>
---

tests/unit/fs_seastarfs_test.cc | 62 +++++++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 5 +++
2 files changed, 67 insertions(+)
create mode 100644 tests/unit/fs_seastarfs_test.cc

diff --git a/tests/unit/fs_seastarfs_test.cc b/tests/unit/fs_seastarfs_test.cc
new file mode 100644
index 00000000..25c3e8d5
--- /dev/null
+++ b/tests/unit/fs_seastarfs_test.cc
@@ -0,0 +1,62 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#include "seastar/core/aligned_buffer.hh"
+#include "seastar/core/file-types.hh"
+#include "seastar/core/file.hh"
+#include "seastar/core/thread.hh"
+#include "seastar/core/units.hh"
+#include "seastar/fs/file.hh"
+#include "seastar/fs/temporary_file.hh"
+#include "seastar/testing/thread_test_case.hh"
+
+using namespace seastar;
+using namespace fs;
+
+constexpr auto device_path = "/tmp/seastarfs";
+constexpr auto device_size = 16 * MB;
+
+SEASTAR_THREAD_TEST_CASE(parallel_read_write_test) {
+ const auto tf = temporary_file(device_path);
+ auto f = fs::open_file_dma(tf.path(), open_flags::rw).get0();
+ static auto alignment = f.memory_dma_alignment();
+
+ parallel_for_each(boost::irange<off_t>(0, device_size / alignment), [&f](auto i) {
+ auto wbuf = allocate_aligned_buffer<unsigned char>(alignment, alignment);
+ std::fill(wbuf.get(), wbuf.get() + alignment, i);
+ auto wb = wbuf.get();
+
+ return f.dma_write(i * alignment, wb, alignment).then(
+ [&f, i, wbuf = std::move(wbuf)](auto ret) mutable {
+ BOOST_REQUIRE_EQUAL(ret, alignment);
+ auto rbuf = allocate_aligned_buffer<unsigned char>(alignment, alignment);
+ auto rb = rbuf.get();
+ return f.dma_read(i * alignment, rb, alignment).then(
+ [f, rbuf = std::move(rbuf), wbuf = std::move(wbuf)](auto ret) {
+ BOOST_REQUIRE_EQUAL(ret, alignment);
+ BOOST_REQUIRE(std::equal(rbuf.get(), rbuf.get() + alignment, wbuf.get()));
+ });
+ });
+ }).wait();
+
+ f.flush().wait();
+ f.close().wait();
+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index 8f203721..f2c5187f 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -361,6 +361,11 @@ seastar_add_test (rpc
loopback_socket.hh
rpc_test.cc)

+if (Seastar_EXPERIMENTAL_FS)
+ seastar_add_test (fs_seastarfs
+ SOURCES fs_seastarfs_test.cc)
+endif()
+
seastar_add_test (semaphore
SOURCES semaphore_test.cc)

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:23 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

What is tested:
- simple reads and writes
- parallel non-overlaping writes then parallel non-overlaping reads
- random and simultaneous reads and writes

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

tests/unit/fs_block_device_test.cc | 206 +++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 3 +
2 files changed, 209 insertions(+)
create mode 100644 tests/unit/fs_block_device_test.cc

diff --git a/tests/unit/fs_block_device_test.cc b/tests/unit/fs_block_device_test.cc
new file mode 100644
index 00000000..6887005c
--- /dev/null
+++ b/tests/unit/fs_block_device_test.cc
@@ -0,0 +1,206 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#include "seastar/core/do_with.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/temporary_buffer.hh"
+
+#include <boost/range/irange.hpp>
+#include <random>
+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <seastar/core/units.hh>
+#include <seastar/fs/block_device.hh>
+#include <seastar/fs/temporary_file.hh>
+#include <seastar/testing/test_runner.hh>
+#include <unistd.h>
+
+using namespace seastar;
+using namespace seastar::fs;
+
+constexpr off_t min_device_size = 16*MB;
+constexpr size_t alignment = 4*KB;
+
+static future<temporary_buffer<char>> allocate_random_aligned_buffer(size_t size) {
+ return do_with(temporary_buffer<char>::aligned(alignment, size),
+ std::default_random_engine(testing::local_random_engine()), [size](auto& buffer, auto& random_engine) {
+ return do_for_each(buffer.get_write(), buffer.get_write() + size, [&](char& c) {
+ std::uniform_int_distribution<> character(0, sizeof(char) * 8 - 1);
+ c = character(random_engine);
+ }).then([&buffer] {
+ return std::move(buffer);

+ });
+ });
+}

+
+static future<> test_basic_read_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*KB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+ auto check_buffer = allocate_random_aligned_buffer(buff_size).get0();
+
+ // Write and read
+ assert(dev.write(0, buffer.get(), buff_size).get0() == buff_size);
+ assert(dev.read(0, check_buffer.get_write(), buff_size).get0() == buff_size);
+ assert(std::memcmp(buffer.get(), check_buffer.get(), buff_size) == 0);
+
+ // Data have to remain after closing
+ dev.close().get0();
+ dev = open_block_device(device_path).get0();
+ check_buffer = allocate_random_aligned_buffer(buff_size).get0(); // Make sure the buffer is written
+ assert(dev.read(0, check_buffer.get_write(), buff_size).get0() == buff_size);
+ assert(std::memcmp(buffer.get(), check_buffer.get(), buff_size) == 0);
+
+ dev.close().get0();
+ });
+}
+
+static future<> test_parallel_read_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*MB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+
+ // Write
+ static_assert(buff_size % alignment == 0);
+ parallel_for_each(boost::irange<off_t>(0, buff_size / alignment), [&](off_t block_no) {
+ off_t offset = block_no * alignment;
+ return dev.write(offset, buffer.get() + offset, alignment).then([](size_t written) {
+ assert(written == alignment);
+ });
+ }).get0();
+
+ // Read
+ static_assert(buff_size % alignment == 0);
+ parallel_for_each(boost::irange<off_t>(0, buff_size / alignment), [&](off_t block_no) {
+ return async([&dev, &buffer, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ assert(std::memcmp(buffer.get() + offset, check_buffer.get(), alignment) == 0);
+ });
+ }).get0();
+
+ dev.close().get0();
+ });
+}
+
+static future<> test_simultaneous_parallel_read_and_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*MB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+ assert(dev.write(0, buffer.get(), buff_size).get0() == buff_size);
+
+ static_assert(buff_size % alignment == 0);
+ size_t blocks_num = buff_size / alignment;
+ enum Kind { WRITE, READ };
+ std::vector<Kind> block_kind(blocks_num);
+ std::default_random_engine random_engine(testing::local_random_engine());
+ std::uniform_int_distribution<> choose_write(0, 1);
+ for (Kind& kind : block_kind) {
+ kind = (choose_write(random_engine) ? WRITE : READ);
+ }
+
+ // Perform simultaneous reads and writes
+ auto new_buffer = allocate_random_aligned_buffer(buff_size).get0();
+ auto write_fut = parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ if (block_kind[block_no] != WRITE) {
+ return now();
+ }
+
+ off_t offset = block_no * alignment;
+ return dev.write(offset, new_buffer.get() + offset, alignment).then([](size_t written) {
+ assert(written == alignment);
+ });
+ });
+ auto read_fut = parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ if (block_kind[block_no] != READ) {
+ return now();
+ }
+
+ return async([&dev, &buffer, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ assert(std::memcmp(buffer.get() + offset, check_buffer.get(), alignment) == 0);

+ });
+ });
+

+ when_all_succeed(std::move(write_fut), std::move(read_fut)).get0();
+
+ // Check that writes were made in the correct places
+ parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ return async([&dev, &buffer, &new_buffer, &block_kind, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ auto& orig_buff = (block_kind[block_no] == WRITE ? new_buffer : buffer);
+ assert(std::memcmp(orig_buff.get() + offset, check_buffer.get(), alignment) == 0);
+ });
+ }).get0();
+
+ dev.close().get0();
+ });
+}
+
+static future<> prepare_file(const std::string& file_path) {
+ return async([&] {
+ // Create device file if it does exist
+ file dev = open_file_dma(file_path, open_flags::rw | open_flags::create).get0();
+
+ auto st = dev.stat().get0();
+ if (S_ISREG(st.st_mode) and st.st_size < min_device_size) {
+ dev.truncate(min_device_size).get0();
+ }
+
+ dev.close().get0();
+ });
+}
+
+int main(int argc, char** argv) {
+ app_template app;
+ app.add_options()
+ ("help", "produce this help message")
+ ("dev", boost::program_options::value<std::string>(),
+ "optional path to device file to test block_device on");
+ return app.run(argc, argv, [&app] {
+ return async([&] {
+ auto& args = app.configuration();
+ std::optional<temporary_file> tmp_device_file;
+ std::string device_path = [&]() -> std::string {
+ if (args.count("dev")) {
+ return args["dev"].as<std::string>();
+ }
+
+ tmp_device_file.emplace("/tmp/block_device_test_file");
+ return tmp_device_file->path();
+ }();
+
+ assert(not device_path.empty());
+ prepare_file(device_path).get0();
+ test_basic_read_write(device_path).get0();
+ test_parallel_read_write(device_path).get0();
+ test_simultaneous_parallel_read_and_write(device_path).get0();

+ });
+ });
+}

diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index f2c5187f..21e564fb 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -362,6 +362,9 @@ seastar_add_test (rpc
rpc_test.cc)

if (Seastar_EXPERIMENTAL_FS)
+ seastar_add_app_test (fs_block_device
+ SOURCES fs_block_device_test.cc
+ LIBRARIES seastar_testing)
seastar_add_test (fs_seastarfs
SOURCES fs_seastarfs_test.cc)
endif()
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:24 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

- units.hh: basic units
- cluster.hh: cluster_id unit and operations on it (converting cluster
ids to offsets)
- inode.hh: inode unit and operations on it (extracting shard_no from
inode and allocating new inode)
- bitwise.hh: bitwise operations
- range.hh: range abstraction

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

src/fs/bitwise.hh | 125 ++++++++++++++++++++++++++++++++++++++++++++++
src/fs/cluster.hh | 42 ++++++++++++++++
src/fs/inode.hh | 80 +++++++++++++++++++++++++++++
src/fs/range.hh | 61 ++++++++++++++++++++++
src/fs/units.hh | 40 +++++++++++++++
CMakeLists.txt | 5 ++
6 files changed, 353 insertions(+)
create mode 100644 src/fs/bitwise.hh
create mode 100644 src/fs/cluster.hh
create mode 100644 src/fs/inode.hh
create mode 100644 src/fs/range.hh
create mode 100644 src/fs/units.hh

diff --git a/src/fs/bitwise.hh b/src/fs/bitwise.hh
new file mode 100644
index 00000000..e53c1919
--- /dev/null
+++ b/src/fs/bitwise.hh

@@ -0,0 +1,125 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#pragma once
+
+#include <cassert>
+#include <type_traits>

+
+namespace seastar::fs {
+

+template<class T, std::enable_if_t<std::is_unsigned_v<T>, int> = 0>
+constexpr inline bool is_power_of_2(T x) noexcept {
+ return (x > 0 and (x & (x - 1)) == 0);
+}
+
+static_assert(not is_power_of_2(0u));
+static_assert(is_power_of_2(1u));
+static_assert(is_power_of_2(2u));
+static_assert(not is_power_of_2(3u));
+static_assert(is_power_of_2(4u));
+static_assert(not is_power_of_2(5u));
+static_assert(not is_power_of_2(6u));
+static_assert(not is_power_of_2(7u));
+static_assert(is_power_of_2(8u));
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
+constexpr inline T div_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a >> __builtin_ctzll(b)); // should be 2 CPU cycles after inlining on modern x86_64
+}
+
+static_assert(div_by_power_of_2(13u, 1u) == 13);
+static_assert(div_by_power_of_2(12u, 4u) == 3);
+static_assert(div_by_power_of_2(42u, 32u) == 1);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
+constexpr inline T mod_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a & (b - 1));
+}
+
+static_assert(mod_by_power_of_2(13u, 1u) == 0);
+static_assert(mod_by_power_of_2(42u, 32u) == 10);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
+constexpr inline T mul_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a << __builtin_ctzll(b)); // should be 2 CPU cycles after inlining on modern x86_64
+}
+
+static_assert(mul_by_power_of_2(3u, 1u) == 3);
+static_assert(mul_by_power_of_2(3u, 4u) == 12);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
+constexpr inline T round_down_to_multiple_of_power_of_2(T a, U b) noexcept {
+ return a - mod_by_power_of_2(a, b);
+}
+
+static_assert(round_down_to_multiple_of_power_of_2(0u, 1u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(1u, 1u) == 1);
+static_assert(round_down_to_multiple_of_power_of_2(19u, 1u) == 19);
+
+static_assert(round_down_to_multiple_of_power_of_2(0u, 2u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(1u, 2u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(2u, 2u) == 2);
+static_assert(round_down_to_multiple_of_power_of_2(3u, 2u) == 2);
+static_assert(round_down_to_multiple_of_power_of_2(4u, 2u) == 4);
+static_assert(round_down_to_multiple_of_power_of_2(5u, 2u) == 4);
+
+static_assert(round_down_to_multiple_of_power_of_2(31u, 16u) == 16);
+static_assert(round_down_to_multiple_of_power_of_2(32u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(33u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(37u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(39u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(45u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(47u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(48u, 16u) == 48);
+static_assert(round_down_to_multiple_of_power_of_2(49u, 16u) == 48);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
+constexpr inline T round_up_to_multiple_of_power_of_2(T a, U b) noexcept {
+ auto mod = mod_by_power_of_2(a, b);
+ return (mod == 0 ? a : a - mod + b);
+}
+
+static_assert(round_up_to_multiple_of_power_of_2(0u, 1u) == 0);
+static_assert(round_up_to_multiple_of_power_of_2(1u, 1u) == 1);
+static_assert(round_up_to_multiple_of_power_of_2(19u, 1u) == 19);
+
+static_assert(round_up_to_multiple_of_power_of_2(0u, 2u) == 0);
+static_assert(round_up_to_multiple_of_power_of_2(1u, 2u) == 2);
+static_assert(round_up_to_multiple_of_power_of_2(2u, 2u) == 2);
+static_assert(round_up_to_multiple_of_power_of_2(3u, 2u) == 4);
+static_assert(round_up_to_multiple_of_power_of_2(4u, 2u) == 4);
+static_assert(round_up_to_multiple_of_power_of_2(5u, 2u) == 6);
+
+static_assert(round_up_to_multiple_of_power_of_2(31u, 16u) == 32);
+static_assert(round_up_to_multiple_of_power_of_2(32u, 16u) == 32);
+static_assert(round_up_to_multiple_of_power_of_2(33u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(37u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(39u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(45u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(47u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(48u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(49u, 16u) == 64);

+
+} // namespace seastar::fs

diff --git a/src/fs/cluster.hh b/src/fs/cluster.hh
new file mode 100644
index 00000000..a35ce323
--- /dev/null
+++ b/src/fs/cluster.hh
@@ -0,0 +1,42 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/units.hh"

+
+namespace seastar::fs {
+

+using cluster_id_t = uint64_t;
+using cluster_range = range<cluster_id_t>;
+
+inline cluster_id_t offset_to_cluster_id(disk_offset_t offset, unit_size_t cluster_size) noexcept {
+ assert(is_power_of_2(cluster_size));
+ return div_by_power_of_2(offset, cluster_size);
+}
+
+inline disk_offset_t cluster_id_to_offset(cluster_id_t cluster_id, unit_size_t cluster_size) noexcept {
+ assert(is_power_of_2(cluster_size));
+ return mul_by_power_of_2(cluster_id, cluster_size);

+}
+
+} // namespace seastar::fs

diff --git a/src/fs/inode.hh b/src/fs/inode.hh
new file mode 100644
index 00000000..aabc8d00
--- /dev/null
+++ b/src/fs/inode.hh
@@ -0,0 +1,80 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+
+#include <cstdint>
+#include <optional>

+
+namespace seastar::fs {
+

+// Last log2(fs_shards_pool_size bits) of the inode number contain the id of shard that owns the inode
+using inode_t = uint64_t;
+
+// Obtains shard id of the shard owning @p inode.
+//@p fs_shards_pool_size is the number of file system shards rounded up to a power of 2
+inline fs_shard_id_t inode_to_shard_no(inode_t inode, fs_shard_id_t fs_shards_pool_size) noexcept {
+ assert(is_power_of_2(fs_shards_pool_size));
+ return mod_by_power_of_2(inode, fs_shards_pool_size);
+}
+
+// Returns inode belonging to the shard owning @p shard_previous_inode that is next after @p shard_previous_inode
+// (i.e. the lowest inode greater than @p shard_previous_inode belonging to the same shard)
+//@p fs_shards_pool_size is the number of file system shards rounded up to a power of 2
+inline inode_t shard_next_inode(inode_t shard_previous_inode, fs_shard_id_t fs_shards_pool_size) noexcept {
+ return shard_previous_inode + fs_shards_pool_size;
+}
+
+// Returns first inode (lowest by value) belonging to the shard @p fs_shard_id
+inline inode_t shard_first_inode(fs_shard_id_t fs_shard_id) noexcept {
+ return fs_shard_id;
+}
+
+class shard_inode_allocator {
+ fs_shard_id_t _fs_shards_pool_size;
+ fs_shard_id_t _fs_shard_id;
+ std::optional<inode_t> _latest_allocated_inode;
+
+public:
+ shard_inode_allocator(fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id, std::optional<inode_t> latest_allocated_inode = std::nullopt)
+ : _fs_shards_pool_size(fs_shards_pool_size)
+ , _fs_shard_id(fs_shard_id)
+ , _latest_allocated_inode(latest_allocated_inode) {}
+
+ inode_t alloc() noexcept {
+ if (not _latest_allocated_inode) {
+ _latest_allocated_inode = shard_first_inode(_fs_shard_id);
+ return *_latest_allocated_inode;
+ }
+
+ _latest_allocated_inode = shard_next_inode(*_latest_allocated_inode, _fs_shards_pool_size);
+ return *_latest_allocated_inode;
+ }
+
+ void reset(std::optional<inode_t> latest_allocated_inode = std::nullopt) noexcept {
+ _latest_allocated_inode = latest_allocated_inode;

+ }
+};
+
+} // namespace seastar::fs

diff --git a/src/fs/range.hh b/src/fs/range.hh
new file mode 100644
index 00000000..ef0c6756
--- /dev/null
+++ b/src/fs/range.hh
@@ -0,0 +1,61 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#pragma once
+
+#include <algorithm>

+
+namespace seastar::fs {
+

+template <class T>
+struct range {
+ T beg;
+ T end; // exclusive
+
+ constexpr bool is_empty() const noexcept { return beg >= end; }
+
+ constexpr T size() const noexcept { return end - beg; }
+};
+
+template <class T>
+range(T beg, T end) -> range<T>;
+
+template <class T>
+inline bool operator==(range<T> a, range<T> b) noexcept {
+ return (a.beg == b.beg and a.end == b.end);
+}
+
+template <class T>
+inline bool operator!=(range<T> a, range<T> b) noexcept {
+ return not (a == b);
+}
+
+template <class T>
+inline range<T> intersection(range<T> a, range<T> b) noexcept {
+ return {std::max(a.beg, b.beg), std::min(a.end, b.end)};
+}
+
+template <class T>
+inline bool are_intersecting(range<T> a, range<T> b) noexcept {
+ return (std::max(a.beg, b.beg) < std::min(a.end, b.end));

+}
+
+} // namespace seastar::fs

diff --git a/src/fs/units.hh b/src/fs/units.hh
new file mode 100644
index 00000000..1fc6754b
--- /dev/null
+++ b/src/fs/units.hh
@@ -0,0 +1,40 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#pragma once
+
+#include "fs/range.hh"
+
+#include <cstdint>

+
+namespace seastar::fs {
+

+using unit_size_t = uint32_t;
+
+using disk_offset_t = uint64_t;
+using disk_range = range<disk_offset_t>;
+
+using file_offset_t = uint64_t;
+using file_range = range<file_offset_t>;
+
+using fs_shard_id_t = uint32_t;

+
+} // namespace seastar::fs

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 39d11ad8..8ad08c7a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -660,7 +660,12 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/block_device.hh
include/seastar/fs/file.hh
include/seastar/fs/temporary_file.hh
+ src/fs/bitwise.hh
+ src/fs/cluster.hh
src/fs/file.cc
+ src/fs/inode.hh
+ src/fs/range.hh
+ src/fs/units.hh
)
endif()

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:25 AM4/20/20

to seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com

From: Wojciech Mitros <wmi...@protonmail.com>

Disk space is divided into segments of set size, called clusters. Each shard of
the filesystem will be assigned a set of clusters. Cluster allocator is the tool
that enables allocating and freeing clusters from that set.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---
src/fs/cluster_allocator.hh | 50 ++++++++++++++++++++++++++++++++++
src/fs/cluster_allocator.cc | 54 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 2 ++
3 files changed, 106 insertions(+)
create mode 100644 src/fs/cluster_allocator.hh
create mode 100644 src/fs/cluster_allocator.cc

diff --git a/src/fs/cluster_allocator.hh b/src/fs/cluster_allocator.hh
new file mode 100644
index 00000000..ef4f30b9
--- /dev/null
+++ b/src/fs/cluster_allocator.hh
@@ -0,0 +1,50 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/cluster.hh"
+
+#include <deque>
+#include <optional>
+#include <unordered_set>
+
+namespace seastar {
+
+namespace fs {
+
+class cluster_allocator {
+ std::unordered_set<cluster_id_t> _allocated_clusters;
+ std::deque<cluster_id_t> _free_clusters;
+
+public:
+ explicit cluster_allocator(std::unordered_set<cluster_id_t> allocated_clusters, std::deque<cluster_id_t> free_clusters);
+
+ // Tries to allocate a cluster
+ std::optional<cluster_id_t> alloc();
+
+ // @p cluster_id has to be allocated using alloc()
+ void free(cluster_id_t cluster_id);
+};
+
+}
+
+}
diff --git a/src/fs/cluster_allocator.cc b/src/fs/cluster_allocator.cc
new file mode 100644
index 00000000..c436c7ba
--- /dev/null
+++ b/src/fs/cluster_allocator.cc
@@ -0,0 +1,54 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#include "fs/cluster.hh"
+#include "fs/cluster_allocator.hh"
+
+#include <cassert>
+#include <optional>
+
+namespace seastar {
+
+namespace fs {
+
+cluster_allocator::cluster_allocator(std::unordered_set<cluster_id_t> allocated_clusters, std::deque<cluster_id_t> free_clusters)
+ : _allocated_clusters(std::move(allocated_clusters)), _free_clusters(std::move(free_clusters)) {}
+
+std::optional<cluster_id_t> cluster_allocator::alloc() {
+ if (_free_clusters.empty()) {
+ return std::nullopt;
+ }
+
+ cluster_id_t cluster_id = _free_clusters.front();
+ _free_clusters.pop_front();
+ _allocated_clusters.insert(cluster_id);
+ return cluster_id;
+}
+
+void cluster_allocator::free(cluster_id_t cluster_id) {
+ assert(_allocated_clusters.count(cluster_id) == 1);
+ _free_clusters.emplace_back(cluster_id);
+ _allocated_clusters.erase(cluster_id);
+}
+
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8ad08c7a..891201a3 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -662,6 +662,8 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
src/fs/cluster.hh
+ src/fs/cluster_allocator.cc
+ src/fs/cluster_allocator.hh
src/fs/file.cc
src/fs/inode.hh
src/fs/range.hh
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:26 AM4/20/20

to seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com

From: Wojciech Mitros <wmi...@protonmail.com>

Added tests checking whether the cluster allocator works correctly in ordinary
and corner (e.g. trying to alloc with no free clusters) cases.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---

tests/unit/fs_cluster_allocator_test.cc | 115 ++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 3 +
2 files changed, 118 insertions(+)
create mode 100644 tests/unit/fs_cluster_allocator_test.cc

diff --git a/tests/unit/fs_cluster_allocator_test.cc b/tests/unit/fs_cluster_allocator_test.cc
new file mode 100644
index 00000000..3650254e
--- /dev/null
+++ b/tests/unit/fs_cluster_allocator_test.cc
@@ -0,0 +1,115 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#define BOOST_TEST_MODULE fs
+
+#include "fs/cluster_allocator.hh"
+
+#include <boost/test/included/unit_test.hpp>
+#include <deque>
+#include <seastar/core/units.hh>
+#include <unordered_set>
+
+using namespace seastar;
+
+BOOST_AUTO_TEST_CASE(test_cluster_0) {
+ fs::cluster_allocator ca({}, {0});
+ BOOST_REQUIRE_EQUAL(ca.alloc().value(), 0);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ ca.free(0);
+ BOOST_REQUIRE_EQUAL(ca.alloc().value(), 0);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+}
+
+BOOST_AUTO_TEST_CASE(test_empty) {
+ fs::cluster_allocator empty_ca{{}, {}};
+ BOOST_REQUIRE(empty_ca.alloc() == std::nullopt);
+}
+
+BOOST_AUTO_TEST_CASE(test_small) {
+ std::deque<fs::cluster_id_t> deq{1, 5, 3, 4, 2};
+ fs::cluster_allocator small_ca({}, deq);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[0]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[1]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+
+ small_ca.free(deq[2]);
+ small_ca.free(deq[1]);
+ small_ca.free(deq[3]);
+ small_ca.free(deq[0]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[4]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[1]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[0]);
+ BOOST_REQUIRE(small_ca.alloc() == std::nullopt);
+
+ small_ca.free(deq[2]);
+ small_ca.free(deq[4]);
+ small_ca.free(deq[3]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[4]);
+ small_ca.free(deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+ small_ca.free(deq[4]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+}
+
+BOOST_AUTO_TEST_CASE(test_max) {
+ constexpr fs::cluster_id_t clusters_per_shard = 1024;
+ std::deque<fs::cluster_id_t> deq;
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ deq.emplace_back(i);
+ }
+ fs::cluster_allocator ordinary_ca({}, deq);
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ BOOST_REQUIRE_EQUAL(ordinary_ca.alloc().value(), i);
+ }
+ BOOST_REQUIRE(ordinary_ca.alloc() == std::nullopt);
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ ordinary_ca.free(i);
+ }
+}
+
+BOOST_AUTO_TEST_CASE(test_pseudo_rand) {
+ std::unordered_set<fs::cluster_id_t> uset;
+ std::deque<fs::cluster_id_t> deq;
+ fs::cluster_id_t elem = 215;
+ while (elem != 806) {
+ deq.emplace_back(elem);
+ elem = (elem * 215) % 1021;
+ }
+ elem = 1;
+ while (elem != 1020) {
+ uset.insert(elem);
+ elem = (elem * 19) % 1021;
+ }
+ fs::cluster_allocator random_ca(uset, deq);
+ elem = 215;
+ while (elem != 1) {
+ BOOST_REQUIRE_EQUAL(random_ca.alloc().value(), elem);
+ random_ca.free(1021-elem);
+ elem = (elem * 215) % 1021;
+ }

+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt

index 21e564fb..b2669e0a 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -365,6 +365,9 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_app_test (fs_block_device
SOURCES fs_block_device_test.cc
LIBRARIES seastar_testing)
+ seastar_add_test (fs_cluster_allocator
+ KIND BOOST
+ SOURCES fs_cluster_allocator_test.cc)

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:27 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Corner case tests:
- simple valid tests for reading and writing bootstrap record
- valid and invalid number of shards (range
[1, bootstrap_record::max_shards_nb] is valid)
- invalid crc in read record
- invalid magic number in read record
- invalid information about filesystem shards:
* id of the first metadata log cluster not in available cluster range
* invalid cluster range
* overlapping available cluster ranges for two different shards
* invalid alignment
* invalid cluster size

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---
tests/unit/fs_mock_block_device.hh | 55 ++++
tests/unit/fs_bootstrap_record_test.cc | 414 +++++++++++++++++++++++++
tests/unit/fs_mock_block_device.cc | 50 +++
tests/unit/CMakeLists.txt | 4 +
4 files changed, 523 insertions(+)
create mode 100644 tests/unit/fs_mock_block_device.hh
create mode 100644 tests/unit/fs_bootstrap_record_test.cc
create mode 100644 tests/unit/fs_mock_block_device.cc

diff --git a/tests/unit/fs_mock_block_device.hh b/tests/unit/fs_mock_block_device.hh
new file mode 100644
index 00000000..08da1491
--- /dev/null
+++ b/tests/unit/fs_mock_block_device.hh
@@ -0,0 +1,55 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include <cstring>
+#include <seastar/fs/block_device.hh>

+
+namespace seastar::fs {
+

+class mock_block_device_impl : public block_device_impl {
+public:
+ using buf_type = basic_sstring<uint8_t, size_t, 32, false>;
+ buf_type buf;
+ ~mock_block_device_impl() override = default;
+
+ struct write_operation {
+ uint64_t disk_offset;
+ temporary_buffer<uint8_t> data;
+ };
+
+ std::vector<write_operation> writes;
+
+ future<size_t> write(uint64_t pos, const void* buffer, size_t len, const io_priority_class&) override;
+
+ future<size_t> read(uint64_t pos, void* buffer, size_t len, const io_priority_class&) noexcept override;
+
+ future<> flush() noexcept override {
+ return make_ready_future<>();
+ }
+

+ future<> close() noexcept override {

+ return make_ready_future<>();
+ }
+};
+
+} // seastar::fs
diff --git a/tests/unit/fs_bootstrap_record_test.cc b/tests/unit/fs_bootstrap_record_test.cc
new file mode 100644
index 00000000..9994f5ee
--- /dev/null
+++ b/tests/unit/fs_bootstrap_record_test.cc
@@ -0,0 +1,414 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#include "fs/bootstrap_record.hh"
+#include "fs/cluster.hh"
+#include "fs/crc.hh"
+#include "fs_mock_block_device.hh"
+
+#include <boost/crc.hpp>
+#include <cstring>
+#include <seastar/core/print.hh>
+#include <seastar/fs/block_device.hh>
+#include <seastar/testing/test_case.hh>
+#include <seastar/testing/test_runner.hh>
+#include <seastar/testing/thread_test_case.hh>

+
+using namespace seastar;
+using namespace seastar::fs;
+

+namespace {
+
+inline std::vector<bootstrap_record::shard_info> prepare_valid_shards_info(uint32_t size) {
+ std::vector<bootstrap_record::shard_info> ret(size);
+ cluster_id_t curr = 1;
+ for (bootstrap_record::shard_info& info : ret) {
+ info.available_clusters = {curr, curr + 1};
+ info.metadata_cluster = curr;
+ curr++;
+ }
+ return ret;
+};
+
+inline void repair_crc32(shared_ptr<mock_block_device_impl> dev_impl) noexcept {
+ mock_block_device_impl::buf_type& buff = dev_impl.get()->buf;
+ constexpr size_t crc_pos = offsetof(bootstrap_record_disk, crc);
+ const uint32_t crc_new = crc32(buff.data(), crc_pos);
+ std::memcpy(buff.data() + crc_pos, &crc_new, sizeof(crc_new));
+}
+
+inline void change_byte_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset) noexcept {
+ dev_impl.get()->buf[offset] ^= 1;
+}
+
+template<typename T>
+inline void place_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset, T value) noexcept {
+ std::memcpy(dev_impl.get()->buf.data() + offset, &value, sizeof(value));
+}
+
+template<>
+inline void place_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset,
+ std::vector<bootstrap_record::shard_info> shards_info) noexcept {
+ bootstrap_record::shard_info shards_info_disk[bootstrap_record::max_shards_nb];
+ std::memset(shards_info_disk, 0, sizeof(shards_info_disk));
+ std::copy(shards_info.begin(), shards_info.end(), shards_info_disk);
+
+ std::memcpy(dev_impl.get()->buf.data() + offset, shards_info_disk, sizeof(shards_info_disk));
+}
+
+inline bool check_exception_message(const invalid_bootstrap_record& ex, const sstring& message) {
+ return sstring(ex.what()).find(message) != sstring::npos;
+}
+
+const bootstrap_record default_write_record(1, bootstrap_record::min_alignment * 4,
+ bootstrap_record::min_alignment * 8, 1, {{6, {6, 9}}, {9, {9, 12}}, {12, {12, 15}}});

+
+}
+
+
+

+BOOST_TEST_DONT_PRINT_LOG_VALUE(bootstrap_record)
+
+SEASTAR_THREAD_TEST_CASE(valid_basic_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+SEASTAR_THREAD_TEST_CASE(valid_max_shards_nb_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(bootstrap_record::max_shards_nb);
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+SEASTAR_THREAD_TEST_CASE(valid_one_shard_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(1);
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+
+
+SEASTAR_THREAD_TEST_CASE(invalid_crc_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t crc_offset = offsetof(bootstrap_record_disk, crc);
+
+ write_record.write_to_disk(dev).get();
+ change_byte_at_offset(dev_impl, crc_offset);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid CRC");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_magic_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t magic_offset = offsetof(bootstrap_record_disk, magic);
+
+ write_record.write_to_disk(dev).get();
+ change_byte_at_offset(dev_impl, magic_offset);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid magic number");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_shards_info_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t shards_nb_offset = offsetof(bootstrap_record_disk, shards_nb);
+ constexpr size_t shards_info_offset = offsetof(bootstrap_record_disk, shards_info);
+
+ // shards_nb > max_shards_nb
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, shards_nb_offset, bootstrap_record::max_shards_nb + 1);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Shards number should be smaller or equal to {}", bootstrap_record::max_shards_nb));
+ });
+
+ // shards_nb == 0
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, shards_nb_offset, 0);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Shards number should be greater than 0");
+ });
+
+ std::vector<bootstrap_record::shard_info> shards_info;
+
+ // metadata_cluster not in available_clusters range
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {2, 3}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ write_record.write_to_disk(dev).get();
+ shards_info = {{3, {2, 3}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ // available_clusters.beg > available_clusters.end
+ write_record.write_to_disk(dev).get();
+ shards_info = {{3, {4, 2}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters.beg == available_clusters.end
+ write_record.write_to_disk(dev).get();
+ shards_info = {{2, {2, 2}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters contains cluster 0
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {0, 5}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Range of available clusters should not contain cluster 0");
+ });
+
+ // available_clusters overlap
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {1, 3}}, {2, {2, 4}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster ranges should not overlap");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_alignment_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t alignment_offset = offsetof(bootstrap_record_disk, alignment);
+
+ // alignment not power of 2
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, alignment_offset, bootstrap_record::min_alignment + 1);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Alignment should be a power of 2");
+ });
+
+ // alignment smaller than 512
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, alignment_offset, bootstrap_record::min_alignment / 2);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Alignment should be greater or equal to {}", bootstrap_record::min_alignment));
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_cluster_size_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t cluster_size_offset = offsetof(bootstrap_record_disk, cluster_size);
+
+ // cluster_size not divisible by alignment
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, cluster_size_offset, write_record.alignment / 2);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be divisible by alignment");
+ });
+
+ // cluster_size not power of 2
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, cluster_size_offset, write_record.alignment * 3);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be a power of 2");

+ });
+}
+
+

+
+SEASTAR_THREAD_TEST_CASE(invalid_shards_info_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // shards_nb > max_shards_nb
+ write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(bootstrap_record::max_shards_nb + 1);
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Shards number should be smaller or equal to {}", bootstrap_record::max_shards_nb));
+ });
+
+ // shards_nb == 0
+ write_record = default_write_record;
+ write_record.shards_info.clear();
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Shards number should be greater than 0");
+ });
+
+ // metadata_cluster not in available_clusters range
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {2, 3}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ write_record = default_write_record;
+ write_record.shards_info = {{3, {2, 3}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ // available_clusters.beg > available_clusters.end
+ write_record = default_write_record;
+ write_record.shards_info = {{3, {4, 2}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters.beg == available_clusters.end
+ write_record = default_write_record;
+ write_record.shards_info = {{2, {2, 2}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters contains cluster 0
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {0, 5}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Range of available clusters should not contain cluster 0");
+ });
+
+ // available_clusters overlap
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {1, 3}}, {2, {2, 4}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster ranges should not overlap");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_alignment_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // alignment not power of 2
+ write_record = default_write_record;
+ write_record.alignment = bootstrap_record::min_alignment + 1;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Alignment should be a power of 2");
+ });
+
+ // alignment smaller than bootstrap_record::min_alignment
+ write_record = default_write_record;
+ write_record.alignment = bootstrap_record::min_alignment / 2;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Alignment should be greater or equal to {}", bootstrap_record::min_alignment));
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_cluster_size_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // cluster_size not divisible by alignment
+ write_record = default_write_record;
+ write_record.cluster_size = write_record.alignment / 2;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be divisible by alignment");
+ });
+
+ // cluster_size not power of 2
+ write_record = default_write_record;
+ write_record.cluster_size = write_record.alignment * 3;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be a power of 2");
+ });
+}
diff --git a/tests/unit/fs_mock_block_device.cc b/tests/unit/fs_mock_block_device.cc
new file mode 100644
index 00000000..6f83587e
--- /dev/null
+++ b/tests/unit/fs_mock_block_device.cc
@@ -0,0 +1,50 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ * Copyright (C) 2020 ScyllaDB Ltd.
+ */
+
+#include "fs_mock_block_device.hh"

+
+namespace seastar::fs {
+

+namespace {
+logger mlogger("fs_mock_block_device");
+} // namespace
+
+future<size_t> mock_block_device_impl::write(uint64_t pos, const void* buffer, size_t len, const io_priority_class&) {
+ mlogger.debug("write({}, ..., {})", pos, len);
+ writes.emplace_back(write_operation {
+ pos,
+ temporary_buffer<uint8_t>(static_cast<const uint8_t*>(buffer), len)
+ });
+ if (buf.size() < pos + len)
+ buf.resize(pos + len);
+ std::memcpy(buf.data() + pos, buffer, len);
+ return make_ready_future<size_t>(len);
+}
+
+future<size_t> mock_block_device_impl::read(uint64_t pos, void* buffer, size_t len, const io_priority_class&) noexcept {
+ mlogger.debug("read({}, ..., {})", pos, len);
+ if (buf.size() < pos + len)
+ buf.resize(pos + len);
+ std::memcpy(buffer, buf.c_str() + pos, len);
+ return make_ready_future<size_t>(len);
+}
+
+} // seastar::fs
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index b2669e0a..f9591046 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -365,6 +365,10 @@ if (Seastar_EXPERIMENTAL_FS)

seastar_add_app_test (fs_block_device
SOURCES fs_block_device_test.cc
LIBRARIES seastar_testing)

+ seastar_add_test (fs_bootstrap_record
+ SOURCES
+ fs_bootstrap_record_test.cc
+ fs_mock_block_device.cc)
seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:27 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Bootstrap record serves the same role as the superblock in other
filesystems.
It contains basic information essential to properly bootstrap the
filesystem:
- filesystem version
- alignment used for data writes
- cluster size
- inode number of the root directory
- information needed to bootstrap every shard of the filesystem:

* id of the first metadata log cluster

* range of available clusters for data and metadata

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

src/fs/bootstrap_record.hh | 98 ++++++++++++++++++
src/fs/crc.hh | 34 ++++++
src/fs/bootstrap_record.cc | 206 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 3 +
4 files changed, 341 insertions(+)
create mode 100644 src/fs/bootstrap_record.hh
create mode 100644 src/fs/crc.hh
create mode 100644 src/fs/bootstrap_record.cc

diff --git a/src/fs/bootstrap_record.hh b/src/fs/bootstrap_record.hh
new file mode 100644
index 00000000..ee15295a
--- /dev/null
+++ b/src/fs/bootstrap_record.hh
@@ -0,0 +1,98 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#pragma once
+
+#include "fs/cluster.hh"

+#include "fs/inode.hh"
+#include "seastar/fs/block_device.hh"
+
+#include <exception>

+
+namespace seastar::fs {
+

+class invalid_bootstrap_record : public std::runtime_error {
+public:
+ explicit invalid_bootstrap_record(const std::string& msg) : std::runtime_error(msg) {}
+ explicit invalid_bootstrap_record(const char* msg) : std::runtime_error(msg) {}
+};
+
+/// In-memory version of the record describing characteristics of the file system (~superblock).
+class bootstrap_record {
+public:
+ static constexpr uint64_t magic_number = 0x5343594c4c414653; // SCYLLAFS
+ static constexpr uint32_t max_shards_nb = 500;
+ static constexpr unit_size_t min_alignment = 4096;
+
+ struct shard_info {
+ cluster_id_t metadata_cluster; /// cluster id of the first metadata log cluster
+ cluster_range available_clusters; /// range of clusters for data for this shard
+ };
+
+ uint64_t version; /// file system version
+ unit_size_t alignment; /// write alignment in bytes
+ unit_size_t cluster_size; /// cluster size in bytes
+ inode_t root_directory; /// root dir inode number
+ std::vector<shard_info> shards_info; /// basic informations about each file system shard
+
+ bootstrap_record() = default;
+ bootstrap_record(uint64_t version, unit_size_t alignment, unit_size_t cluster_size, inode_t root_directory,
+ std::vector<shard_info> shards_info)
+ : version(version), alignment(alignment), cluster_size(cluster_size) , root_directory(root_directory)
+ , shards_info(std::move(shards_info)) {}
+
+ /// number of file system shards
+ uint32_t shards_nb() const noexcept {
+ return shards_info.size();
+ }
+
+ static future<bootstrap_record> read_from_disk(block_device& device);
+ future<> write_to_disk(block_device& device) const;
+
+ friend bool operator==(const bootstrap_record&, const bootstrap_record&) noexcept;
+ friend bool operator!=(const bootstrap_record&, const bootstrap_record&) noexcept;
+};
+
+inline bool operator==(const bootstrap_record::shard_info& lhs, const bootstrap_record::shard_info& rhs) noexcept {
+ return lhs.metadata_cluster == rhs.metadata_cluster and lhs.available_clusters == rhs.available_clusters;
+}
+
+inline bool operator!=(const bootstrap_record::shard_info& lhs, const bootstrap_record::shard_info& rhs) noexcept {
+ return !(lhs == rhs);
+}
+
+inline bool operator!=(const bootstrap_record& lhs, const bootstrap_record& rhs) noexcept {
+ return !(lhs == rhs);
+}
+
+/// On-disk version of the record describing characteristics of the file system (~superblock).
+struct bootstrap_record_disk {
+ uint64_t magic;
+ uint64_t version;
+ unit_size_t alignment;
+ unit_size_t cluster_size;
+ inode_t root_directory;
+ uint32_t shards_nb;
+ bootstrap_record::shard_info shards_info[bootstrap_record::max_shards_nb];
+ uint32_t crc;
+};
+
+}
diff --git a/src/fs/crc.hh b/src/fs/crc.hh
new file mode 100644
index 00000000..da557323
--- /dev/null
+++ b/src/fs/crc.hh
@@ -0,0 +1,34 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+

+#pragma once
+
+#include <boost/crc.hpp>

+
+namespace seastar::fs {
+

+inline uint32_t crc32(const void* buff, size_t len) noexcept {
+ boost::crc_32_type result;
+ result.process_bytes(buff, len);
+ return result.checksum();
+}
+
+}
diff --git a/src/fs/bootstrap_record.cc b/src/fs/bootstrap_record.cc
new file mode 100644
index 00000000..a342efb6
--- /dev/null
+++ b/src/fs/bootstrap_record.cc
@@ -0,0 +1,206 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "fs/bootstrap_record.hh"

+#include "fs/crc.hh"
+#include "seastar/core/print.hh"
+#include "seastar/core/units.hh"

+
+namespace seastar::fs {
+
+namespace {
+

+constexpr unit_size_t write_alignment = 4 * KB;
+constexpr disk_offset_t bootstrap_record_offset = 0;
+
+constexpr size_t aligned_bootstrap_record_size =
+ (1 + (sizeof(bootstrap_record_disk) - 1) / write_alignment) * write_alignment;
+constexpr size_t crc_offset = offsetof(bootstrap_record_disk, crc);
+
+inline std::optional<invalid_bootstrap_record> check_alignment(unit_size_t alignment) {
+ if (!is_power_of_2(alignment)) {
+ return invalid_bootstrap_record(fmt::format("Alignment should be a power of 2, read alignment '{}'",
+ alignment));
+ }
+ if (alignment < bootstrap_record::min_alignment) {
+ return invalid_bootstrap_record(fmt::format("Alignment should be greater or equal to {}, read alignment '{}'",
+ bootstrap_record::min_alignment, alignment));
+ }

+ return std::nullopt;
+}
+

+inline std::optional<invalid_bootstrap_record> check_cluster_size(unit_size_t cluster_size, unit_size_t alignment) {
+ if (!is_power_of_2(cluster_size)) {
+ return invalid_bootstrap_record(fmt::format("Cluster size should be a power of 2, read cluster size '{}'", cluster_size));
+ }
+ if (cluster_size % alignment != 0) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster size should be divisible by alignment, read alignment '{}', read cluster size '{}'",
+ alignment, cluster_size));
+ }

+ return std::nullopt;
+}
+

+inline std::optional<invalid_bootstrap_record> check_shards_number(uint32_t shards_nb) {
+ if (shards_nb == 0) {
+ return invalid_bootstrap_record(fmt::format("Shards number should be greater than 0, read shards number '{}'",
+ shards_nb));
+ }
+ if (shards_nb > bootstrap_record::max_shards_nb) {
+ return invalid_bootstrap_record(fmt::format(
+ "Shards number should be smaller or equal to {}, read shards number '{}'",
+ bootstrap_record::max_shards_nb, shards_nb));
+ }

+ return std::nullopt;
+}
+

+std::optional<invalid_bootstrap_record> check_shards_info(std::vector<bootstrap_record::shard_info> shards_info) {
+ // check 1 <= beg <= metadata_cluster < end
+ for (const bootstrap_record::shard_info& info : shards_info) {
+ if (info.available_clusters.beg >= info.available_clusters.end) {
+ return invalid_bootstrap_record(fmt::format("Invalid cluster range, read cluster range [{}, {})",
+ info.available_clusters.beg, info.available_clusters.end));
+ }
+ if (info.available_clusters.beg == 0) {
+ return invalid_bootstrap_record(fmt::format(
+ "Range of available clusters should not contain cluster 0, read cluster range [{}, {})",
+ info.available_clusters.beg, info.available_clusters.end));
+ }
+ if (info.available_clusters.beg > info.metadata_cluster ||
+ info.available_clusters.end <= info.metadata_cluster) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster with metadata should be inside available cluster range, read cluster range [{}, {}), read metadata cluster '{}'",
+ info.available_clusters.beg, info.available_clusters.end, info.metadata_cluster));
+ }
+ }
+
+ // check that ranges don't overlap
+ sort(shards_info.begin(), shards_info.end(),
+ [] (const bootstrap_record::shard_info& left,
+ const bootstrap_record::shard_info& right) {
+ return left.available_clusters.beg < right.available_clusters.beg;
+ });
+ for (size_t i = 1; i < shards_info.size(); i++) {
+ if (shards_info[i - 1].available_clusters.end > shards_info[i].available_clusters.beg) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster ranges should not overlap, overlaping ranges [{}, {}), [{}, {})",
+ shards_info[i - 1].available_clusters.beg, shards_info[i - 1].available_clusters.end,
+ shards_info[i].available_clusters.beg, shards_info[i].available_clusters.end));
+ }
+ }

+ return std::nullopt;
+}
+
+}

+
+future<bootstrap_record> bootstrap_record::read_from_disk(block_device& device) {
+ auto bootstrap_record_buff = temporary_buffer<char>::aligned(write_alignment, aligned_bootstrap_record_size);
+ return device.read(bootstrap_record_offset, bootstrap_record_buff.get_write(), aligned_bootstrap_record_size)
+ .then([bootstrap_record_buff = std::move(bootstrap_record_buff)] (size_t ret) {
+ if (ret != aligned_bootstrap_record_size) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format(
+ "Error while reading bootstrap record block, {} bytes read instead of {}",
+ ret, aligned_bootstrap_record_size)));
+ }
+
+ bootstrap_record_disk bootstrap_record_disk;
+ std::memcpy(&bootstrap_record_disk, bootstrap_record_buff.get(), sizeof(bootstrap_record_disk));
+
+ const uint32_t crc_calc = crc32(bootstrap_record_buff.get(), crc_offset);
+ if (crc_calc != bootstrap_record_disk.crc) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format("Invalid CRC, expected crc '{}', read crc '{}'",
+ crc_calc, bootstrap_record_disk.crc)));
+ }
+ if (magic_number != bootstrap_record_disk.magic) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format("Invalid magic number, expected magic '{}', read magic '{}'",
+ magic_number, bootstrap_record_disk.magic)));
+ }
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_alignment(bootstrap_record_disk.alignment)) ||
+ (ret_check = check_cluster_size(bootstrap_record_disk.cluster_size, bootstrap_record_disk.alignment)) ||
+ (ret_check = check_shards_number(bootstrap_record_disk.shards_nb))) {
+ return make_exception_future<bootstrap_record>(ret_check.value());
+ }
+
+ const std::vector<shard_info> tmp_shards_info(bootstrap_record_disk.shards_info,
+ bootstrap_record_disk.shards_info + bootstrap_record_disk.shards_nb);
+
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_shards_info(tmp_shards_info))) {
+ return make_exception_future<bootstrap_record>(ret_check.value());
+ }
+
+ bootstrap_record bootstrap_record_mem(bootstrap_record_disk.version,
+ bootstrap_record_disk.alignment,
+ bootstrap_record_disk.cluster_size,
+ bootstrap_record_disk.root_directory,
+ std::move(tmp_shards_info));
+
+ return make_ready_future<bootstrap_record>(std::move(bootstrap_record_mem));
+ });
+}
+
+future<> bootstrap_record::write_to_disk(block_device& device) const {
+ // initial checks
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_alignment(alignment)) ||
+ (ret_check = check_cluster_size(cluster_size, alignment)) ||
+ (ret_check = check_shards_number(shards_nb())) ||
+ (ret_check = check_shards_info(shards_info))) {
+ return make_exception_future<>(ret_check.value());
+ }
+
+ auto bootstrap_record_buff = temporary_buffer<char>::aligned(write_alignment, aligned_bootstrap_record_size);
+ std::memset(bootstrap_record_buff.get_write(), 0, aligned_bootstrap_record_size);
+ bootstrap_record_disk* bootstrap_record_disk = (struct bootstrap_record_disk*)bootstrap_record_buff.get_write();
+
+ // prepare bootstrap_record_disk records
+ bootstrap_record_disk->magic = bootstrap_record::magic_number;
+ bootstrap_record_disk->version = version;
+ bootstrap_record_disk->alignment = alignment;
+ bootstrap_record_disk->cluster_size = cluster_size;
+ bootstrap_record_disk->root_directory = root_directory;
+ bootstrap_record_disk->shards_nb = shards_nb();
+ std::copy(shards_info.begin(), shards_info.end(), bootstrap_record_disk->shards_info);
+ bootstrap_record_disk->crc = crc32(bootstrap_record_disk, crc_offset);
+
+ return device.write(bootstrap_record_offset, bootstrap_record_buff.get(), aligned_bootstrap_record_size)
+ .then([bootstrap_record_buff = std::move(bootstrap_record_buff)] (size_t ret) {
+ if (ret != aligned_bootstrap_record_size) {
+ return make_exception_future<>(
+ invalid_bootstrap_record(fmt::format(
+ "Error while writing bootstrap record block to disk, {} bytes written instead of {}",
+ ret, aligned_bootstrap_record_size)));
+ }
+ return make_ready_future<>();
+ });
+}
+
+bool operator==(const bootstrap_record& lhs, const bootstrap_record& rhs) noexcept {
+ return lhs.version == rhs.version and lhs.alignment == rhs.alignment and
+ lhs.cluster_size == rhs.cluster_size and lhs.root_directory == rhs.root_directory and
+ lhs.shards_info == rhs.shards_info;

+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt

index 891201a3..ca994d42 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -661,9 +661,12 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/file.hh
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
+ src/fs/bootstrap_record.cc
+ src/fs/bootstrap_record.hh
src/fs/cluster.hh
src/fs/cluster_allocator.cc
src/fs/cluster_allocator.hh
+ src/fs/crc.hh

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:28 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

overloaded an useful wrapper that simplifies usage of std:visit over
std::variant. It allows matching variants by type using lambdas in
a similar way that functional languages use.
For details see: https://en.cppreference.com/w/cpp/utility/variant/visit#Example

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

include/seastar/fs/overloaded.hh | 26 ++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 27 insertions(+)
create mode 100644 include/seastar/fs/overloaded.hh

diff --git a/include/seastar/fs/overloaded.hh b/include/seastar/fs/overloaded.hh
new file mode 100644
index 00000000..2a205ba3
--- /dev/null
+++ b/include/seastar/fs/overloaded.hh
@@ -0,0 +1,26 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+

+// Taken from: https://en.cppreference.com/w/cpp/utility/variant/visit
+template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
+template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
diff --git a/CMakeLists.txt b/CMakeLists.txt
index ca994d42..be3f921f 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt

@@ -659,6 +659,7 @@ if (Seastar_EXPERIMENTAL_FS)

# SeastarFS source files
include/seastar/fs/block_device.hh

include/seastar/fs/file.hh
+ include/seastar/fs/overloaded.hh
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
src/fs/bootstrap_record.cc
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:29 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

path.hh provides extract_last_component() function that extracts the
last component of the provided path

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

src/fs/path.hh | 42 ++++++++++++++++++
tests/unit/fs_path_test.cc | 90 ++++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
tests/unit/CMakeLists.txt | 3 ++
4 files changed, 136 insertions(+)
create mode 100644 src/fs/path.hh
create mode 100644 tests/unit/fs_path_test.cc

diff --git a/src/fs/path.hh b/src/fs/path.hh
new file mode 100644
index 00000000..9da4c517
--- /dev/null
+++ b/src/fs/path.hh
@@ -0,0 +1,42 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+

+#include <string>

+
+namespace seastar::fs {
+

+// Extracts the last component in @p path. WARNING: The last component is empty iff @p path is empty or ends with '/'
+inline std::string extract_last_component(std::string& path) {
+ auto beg = path.find_last_of('/');
+ if (beg == path.npos) {
+ std::string res = std::move(path);
+ path = {};
+ return res;
+ }
+
+ auto res = path.substr(beg + 1);
+ path.resize(beg + 1);
+ return res;

+}
+
+} // namespace seastar::fs

diff --git a/tests/unit/fs_path_test.cc b/tests/unit/fs_path_test.cc
new file mode 100644
index 00000000..956e64d7
--- /dev/null
+++ b/tests/unit/fs_path_test.cc
@@ -0,0 +1,90 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#include "fs/path.hh"
+
+#define BOOST_TEST_MODULE fs
+#include <boost/test/included/unit_test.hpp>
+

+using namespace seastar::fs;
+

+BOOST_AUTO_TEST_CASE(last_component_simple) {
+ {
+ std::string str = "";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "/";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "/");
+ }
+ {
+ std::string str = "/foo/bar.txt";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "bar.txt");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/.bar";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".bar");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/bar/";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "/foo/bar/");
+ }
+ {
+ std::string str = "/foo/.";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/..";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "..");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "bar.txt";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "bar.txt");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = ".bar";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".bar");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = ".";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "..";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "..");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "//host";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "host");
+ BOOST_REQUIRE_EQUAL(str, "//");
+ }
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index be3f921f..fb8fe32c 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -670,6 +670,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/crc.hh
src/fs/file.cc
src/fs/inode.hh
+ src/fs/path.hh
src/fs/range.hh
src/fs/units.hh
)
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index f9591046..07551b0b 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -372,6 +372,9 @@ if (Seastar_EXPERIMENTAL_FS)

seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)

+ seastar_add_test (fs_path
+ KIND BOOST
+ SOURCES fs_path_test.cc)

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:30 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

value shared lock is allows to lock (using shared_mutex) a specified value.
One operation locks only one value, but value shared lock allows you to
maintain locks on different values in one place. Also locking is
"on demand" i.e. corresponding shared_mutex will not be created unless a
lock will be used on value and will be deleted as soon as the value is
not being locked by anyone. It serves as a dynamic pool of shared_mutexes
acquired on demand.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

src/fs/value_shared_lock.hh | 65 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 66 insertions(+)
create mode 100644 src/fs/value_shared_lock.hh

diff --git a/src/fs/value_shared_lock.hh b/src/fs/value_shared_lock.hh
new file mode 100644
index 00000000..6c7a3adf
--- /dev/null
+++ b/src/fs/value_shared_lock.hh
@@ -0,0 +1,65 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#pragma once
+
+#include "seastar/core/shared_mutex.hh"
+
+#include <map>

+
+namespace seastar::fs {
+

+template<class Value>
+class value_shared_lock {
+ struct lock_info {
+ size_t users_num = 0;
+ shared_mutex lock;
+ };
+
+ std::map<Value, lock_info> _locks;
+
+public:
+ value_shared_lock() = default;
+
+ template<class Func>
+ auto with_shared_on(const Value& val, Func&& func) {
+ auto it = _locks.emplace(val, lock_info {}).first;
+ ++it->second.users_num;
+ return with_shared(it->second.lock, std::forward<Func>(func)).finally([this, it] {
+ if (--it->second.users_num == 0) {
+ _locks.erase(it);
+ }
+ });
+ }
+
+ template<class Func>
+ auto with_lock_on(const Value& val, Func&& func) {
+ auto it = _locks.emplace(val, lock_info {}).first;
+ ++it->second.users_num;
+ return with_lock(it->second.lock, std::forward<Func>(func)).finally([this, it] {
+ if (--it->second.users_num == 0) {
+ _locks.erase(it);
+ }
+ });
+ }

+};
+
+} // namespace seastar::fs

diff --git a/CMakeLists.txt b/CMakeLists.txt
index fb8fe32c..8a59eca6 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -673,6 +673,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/path.hh
src/fs/range.hh
src/fs/units.hh
+ src/fs/value_shared_lock.hh
)
endif()

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:32 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Creating unlinked file may be useful as temporary file or to expose the
file via path only after the file is filled with contents.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

src/fs/metadata_disk_entries.hh | 51 +++++++++++-
src/fs/metadata_log.hh | 6 ++
src/fs/metadata_log_bootstrap.hh | 2 +
.../create_and_open_unlinked_file.hh | 77 +++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 5 ++
src/fs/metadata_log.cc | 21 +++++
src/fs/metadata_log_bootstrap.cc | 13 ++++
CMakeLists.txt | 1 +
8 files changed, 175 insertions(+), 1 deletion(-)
create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 44c2a1c7..437c2c2b 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -27,10 +27,52 @@

namespace seastar::fs {

+struct ondisk_unix_metadata {
+ uint32_t perms;
+ uint32_t uid;
+ uint32_t gid;
+ uint64_t btime_ns;
+ uint64_t mtime_ns;
+ uint64_t ctime_ns;
+} __attribute__((packed));
+
+static_assert(sizeof(decltype(ondisk_unix_metadata::perms)) >= sizeof(decltype(unix_metadata::perms)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::uid)) >= sizeof(decltype(unix_metadata::uid)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::gid)) >= sizeof(decltype(unix_metadata::gid)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::btime_ns)) >= sizeof(decltype(unix_metadata::btime_ns)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::mtime_ns)) >= sizeof(decltype(unix_metadata::mtime_ns)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::ctime_ns)) >= sizeof(decltype(unix_metadata::ctime_ns)));
+
+inline unix_metadata ondisk_metadata_to_metadata(const ondisk_unix_metadata& ondisk_metadata) noexcept {
+ unix_metadata res;
+ static_assert(sizeof(ondisk_metadata) == 36,
+ "metadata size changed: check if above static asserts and below assignments need update");
+ res.perms = static_cast<file_permissions>(ondisk_metadata.perms);
+ res.uid = ondisk_metadata.uid;
+ res.gid = ondisk_metadata.gid;
+ res.btime_ns = ondisk_metadata.btime_ns;
+ res.mtime_ns = ondisk_metadata.mtime_ns;
+ res.ctime_ns = ondisk_metadata.ctime_ns;

+ return res;
+}
+

+inline ondisk_unix_metadata metadata_to_ondisk_metadata(const unix_metadata& metadata) noexcept {
+ ondisk_unix_metadata res;
+ static_assert(sizeof(res) == 36, "metadata size changed: check if below assignments need update");
+ res.perms = static_cast<decltype(res.perms)>(metadata.perms);
+ res.uid = metadata.uid;
+ res.gid = metadata.gid;
+ res.btime_ns = metadata.btime_ns;
+ res.mtime_ns = metadata.mtime_ns;
+ res.ctime_ns = metadata.ctime_ns;

+ return res;
+}
+

enum ondisk_type : uint8_t {
INVALID = 0,
CHECKPOINT,
NEXT_METADATA_CLUSTER,
+ CREATE_INODE,
};

struct ondisk_checkpoint {
@@ -54,9 +96,16 @@ struct ondisk_next_metadata_cluster {
cluster_id_t cluster_id; // metadata log continues there
} __attribute__((packed));

+struct ondisk_create_inode {
+ inode_t inode;
+ uint8_t is_directory;
+ ondisk_unix_metadata metadata;
+} __attribute__((packed));
+
template<typename T>
constexpr size_t ondisk_entry_size(const T& entry) noexcept {
- static_assert(std::is_same_v<T, ondisk_next_metadata_cluster>, "ondisk entry size not defined for given type");
+ static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or
+ std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index c10852a3..6f069c13 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -156,6 +156,8 @@ class metadata_log {

friend class metadata_log_bootstrap;

+ friend class create_and_open_unlinked_file_operation;
+
public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
shared_ptr<metadata_to_disk_buffer> cluster_buff);
@@ -176,6 +178,8 @@ class metadata_log {
return _inodes.count(inode) != 0;
}

+ inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
+
template<class Func>
void schedule_background_task(Func&& task) {
_background_futures = when_all_succeed(_background_futures.get_future(), std::forward<Func>(task));
@@ -286,6 +290,8 @@ class metadata_log {
// Returns size of the file or throws exception iff @p inode is invalid
file_offset_t file_size(inode_t inode) const;

+ future<inode_t> create_and_open_unlinked_file(file_permissions perms);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 5da79631..4a1fa7e9 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -115,6 +115,8 @@ class metadata_log_bootstrap {

bool inode_exists(inode_t inode);

+ future<> bootstrap_create_inode();
+
public:
static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
diff --git a/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh b/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
new file mode 100644
index 00000000..79c5e9f2
--- /dev/null
+++ b/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
@@ -0,0 +1,77 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "seastar/core/future.hh"

+
+namespace seastar::fs {
+

+class create_and_open_unlinked_file_operation {
+ metadata_log& _metadata_log;
+
+ create_and_open_unlinked_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<inode_t> create_and_open_unlinked_file(file_permissions perms) {
+ using namespace std::chrono;
+ uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ unix_metadata unx_mtdt = {
+ perms,
+ 0, // TODO: Eventually, we'll want a user to be able to pass his credentials when bootstrapping the
+ 0, // file system -- that will allow us to authorize users on startup (e.g. via LDAP or whatnot).
+ curr_time_ns,
+ curr_time_ns,
+ curr_time_ns
+ };
+
+ inode_t new_inode = _metadata_log._inode_allocator.alloc();
+ ondisk_create_inode ondisk_entry {
+ new_inode,
+ false,
+ metadata_to_ondisk_metadata(unx_mtdt)
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future<inode_t>(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future<inode_t>(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ inode_info& new_inode_info = _metadata_log.memory_only_create_inode(new_inode, false, unx_mtdt);
+ // We don't have to lock, as there was no context switch since the allocation of the inode number
+ ++new_inode_info.opened_files_count;
+ return make_ready_future<inode_t>(new_inode);
+ }
+ __builtin_unreachable();
+ }
+
+public:
+ static future<inode_t> perform(metadata_log& metadata_log, file_permissions perms) {
+ return do_with(create_and_open_unlinked_file_operation(metadata_log),
+ [perms = std::move(perms)](auto& obj) {
+ return obj.create_and_open_unlinked_file(std::move(perms));

+ });
+ }
+};
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index bd60f4f3..593ad46a 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -152,6 +152,11 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
}

public:
+ [[nodiscard]] virtual append_result append(const ondisk_create_inode& create_inode) noexcept {
+ // TODO: maybe add a constexpr static field to each ondisk_* entry specifying what type it is?
+ return append_simple(CREATE_INODE, create_inode);
+ }
+
using to_disk_buffer::flush_to_disk;
};

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 6e29f2e5..be523fc7 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -26,6 +26,7 @@
#include "fs/metadata_disk_entries.hh"
#include "fs/metadata_log.hh"
#include "fs/metadata_log_bootstrap.hh"
+#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"
@@ -80,6 +81,22 @@ future<> metadata_log::shutdown() {
});
}

+inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata) {
+ assert(_inodes.count(inode) == 0);
+ return _inodes.emplace(inode, inode_info {
+ 0,
+ 0,
+ metadata,
+ [&]() -> decltype(inode_info::contents) {
+ if (is_directory) {
+ return inode_info::directory {};
+ }
+
+ return inode_info::file {};
+ }()
+ }).first->second;
+}
+
void metadata_log::schedule_flush_of_curr_cluster() {
// Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {
@@ -213,6 +230,10 @@ file_offset_t metadata_log::file_size(inode_t inode) const {
}, it->second.contents);
}

+future<inode_t> metadata_log::create_and_open_unlinked_file(file_permissions perms) {
+ return create_and_open_unlinked_file_operation::perform(*this, std::move(perms));
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 926d79fe..702e0e34 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -211,6 +211,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return invalid_entry_exception();
case NEXT_METADATA_CLUSTER:
return bootstrap_next_metadata_cluster();
+ case CREATE_INODE:
+ return bootstrap_create_inode();
}

// unknown type => metadata log corruption
@@ -242,6 +244,17 @@ bool metadata_log_bootstrap::inode_exists(inode_t inode) {
return _metadata_log._inodes.count(inode) != 0;
}

+future<> metadata_log_bootstrap::bootstrap_create_inode() {
+ ondisk_create_inode entry;
+ if (not _curr_checkpoint.read_entry(entry) or inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ _metadata_log.memory_only_create_inode(entry.inode, entry.is_directory,
+ ondisk_metadata_to_metadata(entry.metadata));

+ return now();
+}
+

future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
// Clear the metadata log
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 19666a8a..3304a02b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -677,6 +677,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log.hh
src/fs/metadata_log_bootstrap.cc
src/fs/metadata_log_bootstrap.hh
+ src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_to_disk_buffer.hh
src/fs/path.hh
src/fs/range.hh
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:33 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

SeastarFS is a log-structured filesystem. Every shard will have 3

private logs:
- metadata log

- medium data log
- big data log (this is not actually a log, but in the big picture it

looks like it was)

Disk space is divided into clusters (typically around several MiB) that
have all equal size that is multiple of alignment (typically 4096
bytes). Each shard has its private pool of clusters (assignment is
stored in bootstrap record). Each log consumes clusters one by one -- it
writes the current one and if cluster becomes full, then log switches to
a new one that is obtained from a pool of free clusters managed by
cluster_allocator. Metadata log and medium data log write data in the
same manner: they fill up the cluster gradually from left to right. Big
data log takes a cluster and completely fills it with data at once -- it
is only used during big writes.

This commit adds the skeleton of the metadata log:
- data structures for holding metadata in memory with all operations on
this data structure i.e. manipulating files and their contents
- locking logic (detailed description can be found in metadata_log.hh)
- buffers for writting logs to disk (one for metadata and one for medium
data)
- basic higher level interface e.g. path lookup, iterating over
directory
- boostraping metadata log == reading metadata log from disk and
reconstructing shard's filesystem structure from just before shutdown

File content is stored as a set of data vectors that may have one of
three kinds: in memory data, on disk data, hole. Small writes are
writted directly to the metadata log and because all metadata is stored
in the memory these writes are also in memory, therefore in-memory kind.
Medium and large data are not stored in memory, so they are represented
using on-disk kind. Enlarging file via truncate may produce holes, hence
hole kind.

Directory entries are stored as metadata log entries -- directory inodes
have no content.

To disk buffers buffer data that will be written to disk. There are two
kinds: (normal) to disk buffer and metadata to disk buffer. The latter
is implemented using the former, but provides higher level interface for
appending metadata log entries rather than raw bytes.

Normal to disk buffer appends data sequentially, but if a flush occurs
the offset where next data will be appended is aligned up to alignment
to ensure that writes to the same cluster are non-overlaping.

Metadata to disk buffer appends data using normal to disk buffer but
does some formatting along the way. The structure of the metadata log on
disk is as follows:
| checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... |
| <---- checkpointed data -----> |
etc. Every batch of metadata_log entries is preceded by a checkpoint
entry. Appending metadata log appends the current batch of entries.
Flushing or lack of space ends current batch of entries and then
checkpoint entry is updated (because it holds CRC code of all
checkpointed data) and then write of the whole batch is requested and a
new checkpoint (if there is space for that) is started. Last checkpoint
in a cluster contains a special entry pointing to the next cluster that
is utilized by the metadata log.

Bootstraping is, in fact, just replying of all actions from metadata log
that were saved on disk. It works as follows:
- reads metadata log clusters one by one
- for each cluster, until the last checkpoint contains pointer to the
next cluster, processes the checkpoint and entries it checkpoints
- processing works as follows:
- checkpoint entry is read and if it is invalid it means that the
metadata log ends here (last checkpoint was partially written or the
metadata log really ended here or there was some data corruption...)
and we stop
- if it is correct, it contains the length of the checkpointed data
(metadata log entries), so then we process all of them (error there
indicates that there was data corruption but CRC is still somehow
correct, so we abort all bootstraping with an error)

Locking is to ensure that concurrent modifications of the metadata do
not corrupt it. E.g. Creating a file is a complex operation: you have
to create inode and add a directory entry that will represent this inode
with a path and write corresponding metadata log entries to the disk.
Simultaneous attempts of creating the same file could corrupt the file
system. Not to mention concurrent create and unlink on the same path...
Thus careful and robust locking mechanism is used. For details see
metadata_log.hh.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

include/seastar/fs/exceptions.hh | 88 +++++++++
src/fs/inode_info.hh | 221 ++++++++++++++++++++++
src/fs/metadata_disk_entries.hh | 63 +++++++
src/fs/metadata_log.hh | 295 ++++++++++++++++++++++++++++++
src/fs/metadata_log_bootstrap.hh | 123 +++++++++++++
src/fs/metadata_to_disk_buffer.hh | 158 ++++++++++++++++
src/fs/to_disk_buffer.hh | 138 ++++++++++++++
src/fs/unix_metadata.hh | 40 ++++
src/fs/metadata_log.cc | 222 ++++++++++++++++++++++
src/fs/metadata_log_bootstrap.cc | 264 ++++++++++++++++++++++++++
CMakeLists.txt | 10 +
11 files changed, 1622 insertions(+)
create mode 100644 include/seastar/fs/exceptions.hh

create mode 100644 src/fs/inode_info.hh
create mode 100644 src/fs/metadata_disk_entries.hh
create mode 100644 src/fs/metadata_log.hh
create mode 100644 src/fs/metadata_log_bootstrap.hh

create mode 100644 src/fs/metadata_to_disk_buffer.hh
create mode 100644 src/fs/to_disk_buffer.hh
create mode 100644 src/fs/unix_metadata.hh

create mode 100644 src/fs/metadata_log.cc
create mode 100644 src/fs/metadata_log_bootstrap.cc

diff --git a/include/seastar/fs/exceptions.hh b/include/seastar/fs/exceptions.hh
new file mode 100644
index 00000000..9941f557
--- /dev/null
+++ b/include/seastar/fs/exceptions.hh
@@ -0,0 +1,88 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include <exception>

+
+namespace seastar::fs {
+

+struct fs_exception : public std::exception {
+ const char* what() const noexcept override = 0;
+};
+
+struct cluster_size_too_small_to_perform_operation_exception : public std::exception {
+ const char* what() const noexcept override { return "Cluster size is too small to perform operation"; }
+};
+
+struct invalid_inode_exception : public fs_exception {
+ const char* what() const noexcept override { return "Invalid inode"; }
+};
+
+struct invalid_argument_exception : public fs_exception {
+ const char* what() const noexcept override { return "Invalid argument"; }
+};
+
+struct operation_became_invalid_exception : public fs_exception {
+ const char* what() const noexcept override { return "Operation became invalid"; }
+};
+
+struct no_more_space_exception : public fs_exception {
+ const char* what() const noexcept override { return "No more space on device"; }
+};
+
+struct file_already_exists_exception : public fs_exception {
+ const char* what() const noexcept override { return "File already exists"; }
+};
+
+struct filename_too_long_exception : public fs_exception {
+ const char* what() const noexcept override { return "Filename too long"; }
+};
+
+struct is_directory_exception : public fs_exception {
+ const char* what() const noexcept override { return "Is a directory"; }
+};
+
+struct directory_not_empty_exception : public fs_exception {
+ const char* what() const noexcept override { return "Directory is not empty"; }
+};
+
+struct path_lookup_exception : public fs_exception {
+ const char* what() const noexcept override = 0;
+};
+
+struct path_is_not_absolute_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "Path is not absolute"; }
+};
+
+struct invalid_path_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "Path is invalid"; }
+};
+
+struct no_such_file_or_directory_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "No such file or directory"; }
+};
+
+struct path_component_not_directory_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "A component used as a directory is not a directory"; }

+};
+
+} // namespace seastar::fs

diff --git a/src/fs/inode_info.hh b/src/fs/inode_info.hh
new file mode 100644
index 00000000..89bc71d8
--- /dev/null
+++ b/src/fs/inode_info.hh
@@ -0,0 +1,221 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include "fs/inode.hh"
+#include "fs/units.hh"
+#include "fs/unix_metadata.hh"
+#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/overloaded.hh"
+
+#include <map>
+#include <variant>

+
+namespace seastar::fs {
+

+struct inode_data_vec {
+ file_range data_range; // data spans [beg, end) range of the file
+
+ struct in_mem_data {

+ temporary_buffer<uint8_t> data;
+ };
+

+ struct on_disk_data {
+ file_offset_t device_offset;
+ };
+
+ struct hole_data { };
+
+ std::variant<in_mem_data, on_disk_data, hole_data> data_location;
+
+ // TODO: rename that function to something more suitable
+ inode_data_vec share_copy() {
+ inode_data_vec shared;
+ shared.data_range = data_range;
+ std::visit(overloaded {
+ [&](inode_data_vec::in_mem_data& mem) {
+ shared.data_location = inode_data_vec::in_mem_data {mem.data.share()};
+ },
+ [&](inode_data_vec::on_disk_data& disk_data) {
+ shared.data_location = disk_data;
+ },
+ [&](inode_data_vec::hole_data&) {
+ shared.data_location = inode_data_vec::hole_data {};
+ },
+ }, data_location);
+ return shared;
+ }
+};
+
+struct inode_info {
+ uint32_t opened_files_count = 0; // Number of open files referencing inode
+ uint32_t directories_containing_file = 0;
+ unix_metadata metadata;
+
+ struct directory {
+ // TODO: directory entry cannot contain '/' character --> add checks for that
+ std::map<std::string, inode_t, std::less<>> entries; // entry name => inode
+ };
+
+ struct file {
+ std::map<file_offset_t, inode_data_vec> data; // file offset => data vector that begins there (data vectors
+ // do not overlap)
+
+ file_offset_t size() const noexcept {
+ return (data.empty() ? 0 : data.rbegin()->second.data_range.end);
+ }
+
+ // Deletes data vectors that are subset of @p data_range and cuts overlapping data vectors to make them
+ // not overlap. @p cut_data_vec_processor is called on each inode_data_vec (including parts of overlapped
+ // data vectors) that will be deleted
+ template<class Func>
+ void cut_out_data_range(file_range range, Func&& cut_data_vec_processor) {
+ static_assert(std::is_invocable_v<Func, inode_data_vec>);
+ // Cut all vectors intersecting with range
+ auto it = data.lower_bound(range.beg);
+ if (it != data.begin() and are_intersecting(range, prev(it)->second.data_range)) {
+ --it;
+ }
+
+ while (it != data.end() and are_intersecting(range, it->second.data_range)) {
+ auto data_vec = std::move(data.extract(it++).mapped());
+ const auto cap = intersection(range, data_vec.data_range);
+ if (cap == data_vec.data_range) {
+ // Fully intersects => remove it
+ cut_data_vec_processor(std::move(data_vec));
+ continue;
+ }
+
+ // Overlaps => cut it, possibly into two parts:
+ // | data_vec |
+ // | cap |
+ // | left | mid | right |
+ // left and right remain, but mid is deleted
+ inode_data_vec left, mid, right;
+ left.data_range = {data_vec.data_range.beg, cap.beg};
+ mid.data_range = cap;
+ right.data_range = {cap.end, data_vec.data_range.end};
+ auto right_beg_shift = right.data_range.beg - data_vec.data_range.beg;
+ auto mid_beg_shift = mid.data_range.beg - data_vec.data_range.beg;
+ std::visit(overloaded {
+ [&](inode_data_vec::in_mem_data& mem) {
+ left.data_location = inode_data_vec::in_mem_data {mem.data.share(0, left.data_range.size())};
+ mid.data_location = inode_data_vec::in_mem_data {
+ mem.data.share(mid_beg_shift, mid.data_range.size())
+ };
+ right.data_location = inode_data_vec::in_mem_data {
+ mem.data.share(right_beg_shift, right.data_range.size())
+ };
+ },
+ [&](inode_data_vec::on_disk_data& disk_data) {
+ left.data_location = disk_data;
+ mid.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + mid_beg_shift};
+ right.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + right_beg_shift};
+ },
+ [&](inode_data_vec::hole_data&) {
+ left.data_location = inode_data_vec::hole_data {};
+ mid.data_location = inode_data_vec::hole_data {};
+ right.data_location = inode_data_vec::hole_data {};
+ },
+ }, data_vec.data_location);
+
+ // Save new data vectors
+ if (not left.data_range.is_empty()) {
+ data.emplace(left.data_range.beg, std::move(left));
+ }
+ if (not right.data_range.is_empty()) {
+ data.emplace(right.data_range.beg, std::move(right));
+ }
+
+ // Process deleted vector
+ cut_data_vec_processor(std::move(mid));
+ }
+ }
+
+ // Executes @p execute_on_data_ranges_processor on each data vector that is a subset of @p data_range.
+ // Data vectors on the edges are appropriately trimmed before passed to the function.
+ template<class Func>
+ void execute_on_data_range(file_range range, Func&& execute_on_data_range_processor) {
+ static_assert(std::is_invocable_v<Func, inode_data_vec>);
+ auto it = data.lower_bound(range.beg);
+ if (it != data.begin() and are_intersecting(range, prev(it)->second.data_range)) {
+ --it;
+ }
+
+ while (it != data.end() and are_intersecting(range, it->second.data_range)) {
+ auto& data_vec = (it++)->second;
+ const auto cap = intersection(range, data_vec.data_range);
+ if (cap == data_vec.data_range) {
+ // Fully intersects => execute
+ execute_on_data_range_processor(data_vec.share_copy());
+ continue;
+ }
+
+ inode_data_vec mid;
+ mid.data_range = std::move(cap);
+ auto mid_beg_shift = mid.data_range.beg - data_vec.data_range.beg;
+ std::visit(overloaded {
+ [&](inode_data_vec::in_mem_data& mem) {
+ mid.data_location = inode_data_vec::in_mem_data {
+ mem.data.share(mid_beg_shift, mid.data_range.size())
+ };
+ },
+ [&](inode_data_vec::on_disk_data& disk_data) {
+ mid.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + mid_beg_shift};
+ },
+ [&](inode_data_vec::hole_data&) {
+ mid.data_location = inode_data_vec::hole_data {};
+ },
+ }, data_vec.data_location);
+
+ // Execute on middle range
+ execute_on_data_range_processor(std::move(mid));
+ }
+ }
+ };
+
+ std::variant<directory, file> contents;
+
+ bool is_linked() const noexcept {
+ return directories_containing_file != 0;
+ }
+
+ bool is_open() const noexcept {
+ return opened_files_count != 0;
+ }
+
+ constexpr bool is_directory() const noexcept { return std::holds_alternative<directory>(contents); }
+
+ // These are noexcept because invalid access is a bug not an error
+ constexpr directory& get_directory() & noexcept { return std::get<directory>(contents); }
+ constexpr const directory& get_directory() const & noexcept { return std::get<directory>(contents); }
+ constexpr directory&& get_directory() && noexcept { return std::move(std::get<directory>(contents)); }
+
+ constexpr bool is_file() const noexcept { return std::holds_alternative<file>(contents); }
+
+ // These are noexcept because invalid access is a bug not an error
+ constexpr file& get_file() & noexcept { return std::get<file>(contents); }
+ constexpr const file& get_file() const & noexcept { return std::get<file>(contents); }
+ constexpr file&& get_file() && noexcept { return std::move(std::get<file>(contents)); }

+};
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
new file mode 100644
index 00000000..44c2a1c7
--- /dev/null
+++ b/src/fs/metadata_disk_entries.hh
@@ -0,0 +1,63 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include "fs/cluster.hh"
+#include "fs/inode.hh"
+#include "fs/unix_metadata.hh"

+
+namespace seastar::fs {
+

+enum ondisk_type : uint8_t {
+ INVALID = 0,
+ CHECKPOINT,
+ NEXT_METADATA_CLUSTER,
+};
+
+struct ondisk_checkpoint {
+ // The disk format is as follows:
+ // | ondisk_checkpoint | .............................. |
+ // | data |
+ // |<-- checkpointed_data_length -->|
+ // ^
+ // ______________________________________________/
+ // /
+ // there ends checkpointed data and (next checkpoint begins or metadata in the current cluster end)
+ //
+ // CRC is calculated from byte sequence | data | checkpointed_data_length |
+ // E.g. if the data consist of bytes "abcd" and checkpointed_data_length of bytes "xyz" then the byte sequence
+ // would be "abcdxyz"
+ uint32_t crc32_code;
+ unit_size_t checkpointed_data_length;
+} __attribute__((packed));
+
+struct ondisk_next_metadata_cluster {
+ cluster_id_t cluster_id; // metadata log continues there
+} __attribute__((packed));
+
+template<typename T>
+constexpr size_t ondisk_entry_size(const T& entry) noexcept {
+ static_assert(std::is_same_v<T, ondisk_next_metadata_cluster>, "ondisk entry size not defined for given type");
+ return sizeof(ondisk_type) + sizeof(entry);

+}
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
new file mode 100644
index 00000000..c10852a3
--- /dev/null
+++ b/src/fs/metadata_log.hh
@@ -0,0 +1,295 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/cluster.hh"
+#include "fs/cluster_allocator.hh"
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_to_disk_buffer.hh"
+#include "fs/units.hh"
+#include "fs/unix_metadata.hh"
+#include "fs/value_shared_lock.hh"
+#include "seastar/core/file-types.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/shared_future.hh"
+#include "seastar/core/shared_ptr.hh"
+#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/exceptions.hh"
+
+#include <chrono>
+#include <cstddef>
+#include <exception>
+#include <type_traits>
+#include <utility>
+#include <variant>

+
+namespace seastar::fs {
+

+class metadata_log {
+ block_device _device;
+ const unit_size_t _cluster_size;
+ const unit_size_t _alignment;
+
+ // Takes care of writing current cluster of serialized metadata log entries to device
+ shared_ptr<metadata_to_disk_buffer> _curr_cluster_buff;
+ shared_future<> _background_futures = now();
+
+ // In memory metadata
+ cluster_allocator _cluster_allocator;
+ std::map<inode_t, inode_info> _inodes;
+ inode_t _root_dir;
+ shard_inode_allocator _inode_allocator;
+
+ // Locks are used to ensure metadata consistency while allowing concurrent usage.
+ //
+ // Whenever one wants to create or delete inode or directory entry, one has to acquire appropriate unique lock for
+ // the inode / dir entry that will appear / disappear and only after locking that operation should take place.
+ // Shared locks should be used only to ensure that an inode / dir entry won't disappear / appear, while some action
+ // is performed. Therefore, unique locks ensure that resource is not used by anyone else.
+ //
+ // IMPORTANT: if an operation needs to acquire more than one lock, it has to be done with *one* call to
+ // locks::with_locks() because it is ensured there that a deadlock-free locking order is used (for details see
+ // that function).
+ //
+ // Examples:
+ // - To create file we have to take shared lock (SL) on the directory to which we add a dir entry and
+ // unique lock (UL) on the added entry in this directory. SL is taken because the directory should not disappear.
+ // UL is taken, because we do not want the entry to appear while we are creating it.
+ // - To read or write to a file, a SL is acquired on its inode and then the operation is performed.
+ class locks {
+ value_shared_lock<inode_t> _inode_locks;
+ value_shared_lock<std::pair<inode_t, std::string>> _dir_entry_locks;
+
+ public:
+ struct shared {
+ inode_t inode;
+ std::optional<std::string> dir_entry;
+ };
+
+ template<class T>
+ static constexpr bool is_shared = std::is_same_v<std::remove_cv_t<std::remove_reference_t<T>>, shared>;
+
+ struct unique {
+ inode_t inode;
+ std::optional<std::string> dir_entry;
+ };
+
+ template<class T>
+ static constexpr bool is_unique = std::is_same_v<std::remove_cv_t<std::remove_reference_t<T>>, unique>;
+
+ template<class Kind, class Func>
+ auto with_lock(Kind kind, Func&& func) {
+ static_assert(is_shared<Kind> or is_unique<Kind>);
+ if constexpr (is_shared<Kind>) {
+ if (kind.dir_entry.has_value()) {
+ return _dir_entry_locks.with_shared_on({kind.inode, std::move(*kind.dir_entry)},
+ std::forward<Func>(func));
+ } else {
+ return _inode_locks.with_shared_on(kind.inode, std::forward<Func>(func));
+ }
+ } else {
+ if (kind.dir_entry.has_value()) {
+ return _dir_entry_locks.with_lock_on({kind.inode, std::move(*kind.dir_entry)},
+ std::forward<Func>(func));
+ } else {
+ return _inode_locks.with_lock_on(kind.inode, std::forward<Func>(func));
+ }
+ }
+ }
+
+ private:
+ template<class Kind1, class Kind2, class Func>
+ auto with_locks_in_order(Kind1 kind1, Kind2 kind2, Func func) {
+ // Func is not an universal reference because we will have to store it
+ return with_lock(std::move(kind1), [this, kind2 = std::move(kind2), func = std::move(func)] () mutable {
+ return with_lock(std::move(kind2), std::move(func));
+ });
+ };
+
+ public:
+
+ template<class Kind1, class Kind2, class Func>
+ auto with_locks(Kind1 kind1, Kind2 kind2, Func&& func) {
+ static_assert(is_shared<Kind1> or is_unique<Kind1>);
+ static_assert(is_shared<Kind2> or is_unique<Kind2>);
+
+ // Locking order is as follows: kind with lower tuple (inode, dir_entry) goes first.
+ // This order is linear and we always lock in one direction, so the graph of locking relations (A -> B iff
+ // lock on A is acquired and lock on B is acquired / being acquired) makes a DAG. Thus, deadlock is
+ // impossible, as it would require a cycle to appear.
+ std::pair<inode_t, std::optional<std::string>&> k1 {kind1.inode, kind1.dir_entry};
+ std::pair<inode_t, std::optional<std::string>&> k2 {kind2.inode, kind2.dir_entry};
+ if (k1 < k2) {
+ return with_locks_in_order(std::move(kind1), std::move(kind2), std::forward<Func>(func));
+ } else {
+ return with_locks_in_order(std::move(kind2), std::move(kind1), std::forward<Func>(func));
+ }
+ }
+ } _locks;
+
+ // TODO: for compaction: keep some set(?) of inode_data_vec, so that we can keep track of clusters that have lowest
+ // utilization (up-to-date data)
+ // TODO: for compaction: keep estimated metadata log size (that would take when written to disk) and
+ // the real size of metadata log taken on disk to allow for detecting when compaction
+
+ friend class metadata_log_bootstrap;
+
+public:
+ metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
+ shared_ptr<metadata_to_disk_buffer> cluster_buff);
+
+ metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment);
+
+ metadata_log(const metadata_log&) = delete;
+ metadata_log& operator=(const metadata_log&) = delete;
+ metadata_log(metadata_log&&) = default;
+
+ future<> bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
+ fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
+
+ future<> shutdown();
+
+private:
+ bool inode_exists(inode_t inode) const noexcept {
+ return _inodes.count(inode) != 0;
+ }

+
+ template<class Func>

+ void schedule_background_task(Func&& task) {
+ _background_futures = when_all_succeed(_background_futures.get_future(), std::forward<Func>(task));
+ }
+
+ void schedule_flush_of_curr_cluster();
+
+ enum class flush_result {
+ DONE,
+ NO_SPACE
+ };
+
+ [[nodiscard]] flush_result schedule_flush_of_curr_cluster_and_change_it_to_new_one();
+
+ future<> flush_curr_cluster();
+
+ enum class append_result {
+ APPENDED,
+ TOO_BIG,
+ NO_SPACE
+ };
+
+ template<class... Args>
+ [[nodiscard]] append_result append_ondisk_entry(Args&&... args) {
+ using AR = append_result;
+ // TODO: maybe check for errors on _background_futures to expose previous errors?
+ switch (_curr_cluster_buff->append(args...)) {
+ case metadata_to_disk_buffer::APPENDED:
+ return AR::APPENDED;
+ case metadata_to_disk_buffer::TOO_BIG:
+ break;
+ }
+
+ switch (schedule_flush_of_curr_cluster_and_change_it_to_new_one()) {
+ case flush_result::NO_SPACE:
+ return AR::NO_SPACE;
+ case flush_result::DONE:
+ break;
+ }
+
+ switch (_curr_cluster_buff->append(args...)) {
+ case metadata_to_disk_buffer::APPENDED:
+ return AR::APPENDED;
+ case metadata_to_disk_buffer::TOO_BIG:
+ return AR::TOO_BIG;
+ }

+
+ __builtin_unreachable();
+ }
+

+ enum class path_lookup_error {
+ NOT_ABSOLUTE, // a path is not absolute
+ NO_ENTRY, // no such file or directory
+ NOT_DIR, // a component used as a directory in path is not, in fact, a directory
+ };
+
+ std::variant<inode_t, path_lookup_error> do_path_lookup(const std::string& path) const noexcept;
+
+ // It is safe for @p path to be a temporary (there is no need to worry about its lifetime)
+ future<inode_t> path_lookup(const std::string& path) const;
+
+public:
+ template<class Func>
+ future<> iterate_directory(const std::string& dir_path, Func func) {
+ static_assert(std::is_invocable_r_v<future<>, Func, const std::string&> or
+ std::is_invocable_r_v<future<stop_iteration>, Func, const std::string&>);
+ auto convert_func = [&]() -> decltype(auto) {
+ if constexpr (std::is_invocable_r_v<future<stop_iteration>, Func, const std::string&>) {
+ return std::move(func);
+ } else {
+ return [func = std::move(func)]() -> future<stop_iteration> {
+ return func().then([] {
+ return stop_iteration::no;

+ });
+ };
+ }
+ };

+ return path_lookup(dir_path).then([this, func = convert_func()](inode_t dir_inode) {
+ return do_with(std::move(func), std::string {}, [this, dir_inode](auto& func, auto& prev_entry) {
+ auto it = _inodes.find(dir_inode);
+ if (it == _inodes.end()) {
+ return now(); // Directory disappeared
+ }
+ if (not it->second.is_directory()) {
+ return make_exception_future(path_component_not_directory_exception());
+ }
+
+ return repeat([this, dir_inode, &prev_entry, &func] {
+ auto it = _inodes.find(dir_inode);
+ if (it == _inodes.end()) {
+ return make_ready_future<stop_iteration>(stop_iteration::yes); // Directory disappeared
+ }
+ assert(it->second.is_directory() and "Directory cannot become a file");
+ auto& dir = it->second.get_directory();
+
+ auto entry_it = dir.entries.upper_bound(prev_entry);
+ if (entry_it == dir.entries.end()) {
+ return make_ready_future<stop_iteration>(stop_iteration::yes); // No more entries
+ }
+
+ prev_entry = entry_it->first;
+ return func(static_cast<const std::string&>(prev_entry));
+ });
+ });
+ });
+ }
+
+ // Returns size of the file or throws exception iff @p inode is invalid
+ file_offset_t file_size(inode_t inode) const;
+

+ // All disk-related errors will be exposed here

+ future<> flush_log() {
+ return flush_curr_cluster();

+ }
+};
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
new file mode 100644
index 00000000..5da79631
--- /dev/null
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -0,0 +1,123 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/bitwise.hh"
+#include "fs/cluster.hh"
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_to_disk_buffer.hh"
+#include "fs/units.hh"
+#include "fs/metadata_log.hh"
+#include "seastar/core/do_with.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/temporary_buffer.hh"
+
+#include <boost/crc.hpp>
+#include <cstddef>
+#include <cstring>
+#include <unordered_set>
+#include <variant>

+
+namespace seastar::fs {
+

+// TODO: add a comment about what it is
+class data_reader {
+ const uint8_t* _data = nullptr;
+ size_t _size = 0;
+ size_t _pos = 0;
+ size_t _last_checkpointed_pos = 0;
+
+public:
+ data_reader() = default;
+
+ data_reader(const uint8_t* data, size_t size) : _data(data), _size(size) {}
+
+ size_t curr_pos() const noexcept { return _pos; }
+
+ size_t last_checkpointed_pos() const noexcept { return _last_checkpointed_pos; }
+
+ size_t bytes_left() const noexcept { return _size - _pos; }
+
+ void align_curr_pos(size_t alignment) noexcept { _pos = round_up_to_multiple_of_power_of_2(_pos, alignment); }
+
+ void checkpoint_curr_pos() noexcept { _last_checkpointed_pos = _pos; }
+
+ // Returns whether the reading was successful
+ bool read(void* destination, size_t size);
+
+ // Returns whether the reading was successful
+ template<class T>
+ bool read_entry(T& entry) noexcept {
+ return read(&entry, sizeof(entry));
+ }
+
+ // Returns whether the reading was successful
+ bool read_string(std::string& str, size_t size);
+
+ std::optional<temporary_buffer<uint8_t>> read_tmp_buff(size_t size);
+
+ // Returns whether the processing was successful
+ bool process_crc_without_reading(boost::crc_32_type& crc, size_t size);
+
+ std::optional<data_reader> extract(size_t size);
+};
+
+class metadata_log_bootstrap {
+ metadata_log& _metadata_log;
+ cluster_range _available_clusters;
+ std::unordered_set<cluster_id_t> _taken_clusters;
+ std::optional<cluster_id_t> _next_cluster;
+ temporary_buffer<uint8_t> _curr_cluster_data;
+ data_reader _curr_cluster;
+ data_reader _curr_checkpoint;
+
+ metadata_log_bootstrap(metadata_log& metadata_log, cluster_range available_clusters);
+
+ future<> bootstrap(cluster_id_t first_metadata_cluster_id, fs_shard_id_t fs_shards_pool_size,
+ fs_shard_id_t fs_shard_id);
+
+ future<> bootstrap_cluster(cluster_id_t curr_cluster);
+
+ static auto invalid_entry_exception() {
+ return make_exception_future<>(std::runtime_error("Invalid metadata log entry"));
+ }
+
+ future<> bootstrap_read_cluster();
+
+ // Returns whether reading and checking was successful
+ bool read_and_check_checkpoint();
+
+ future<> bootstrap_checkpointed_data();
+
+ future<> bootstrap_next_metadata_cluster();
+
+ bool inode_exists(inode_t inode);
+
+public:
+ static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
+ cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);

+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

new file mode 100644
index 00000000..bd60f4f3
--- /dev/null
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -0,0 +1,158 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include "fs/bitwise.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/to_disk_buffer.hh"
+
+#include <boost/crc.hpp>

+
+namespace seastar::fs {
+

+// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
+// in order to finish construction.
+class metadata_to_disk_buffer : protected to_disk_buffer {
+ boost::crc_32_type _crc;
+
+public:
+ metadata_to_disk_buffer() = default;
+
+ using to_disk_buffer::init; // Explicitly stated that stays the same
+
+ virtual shared_ptr<metadata_to_disk_buffer> virtual_constructor() const {
+ return make_shared<metadata_to_disk_buffer>();
+ }
+
+ /**
+ * @brief Inits object, leaving it in state as if just after flushing with unflushed data end at
+ * @p cluster_beg_offset
+ *
+ * @param aligned_max_size size of the buffer, must be aligned
+ * @param alignment write alignment
+ * @param cluster_beg_offset disk offset of the beginning of the cluster
+ * @param metadata_end_pos position at which valid metadata ends: valid metadata range: [0, @p metadata_end_pos)
+ */
+ virtual void init_from_bootstrapped_cluster(size_t aligned_max_size, unit_size_t alignment,
+ disk_offset_t cluster_beg_offset, size_t metadata_end_pos) {
+ assert(is_power_of_2(alignment));
+ assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
+ assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
+ assert(aligned_max_size >= sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ assert(alignment >= sizeof(ondisk_type) + sizeof(ondisk_checkpoint) + sizeof(ondisk_type) +
+ sizeof(ondisk_next_metadata_cluster) and
+ "We always need to be able to pack at least a checkpoint and next_metadata_cluster entry to the last "
+ "data flush in the cluster");
+ assert(metadata_end_pos < aligned_max_size);
+
+ _max_size = aligned_max_size;
+ _alignment = alignment;
+ _cluster_beg_offset = cluster_beg_offset;
+ auto aligned_pos = round_up_to_multiple_of_power_of_2(metadata_end_pos, _alignment);
+ _unflushed_data = {aligned_pos, aligned_pos};
+ _buff = decltype(_buff)::aligned(_alignment, _max_size);
+
+ start_new_unflushed_data();
+ }
+
+protected:
+ void start_new_unflushed_data() noexcept override {
+ if (bytes_left() < sizeof(ondisk_type) + sizeof(ondisk_checkpoint) + sizeof(ondisk_type) +
+ sizeof(ondisk_next_metadata_cluster)) {
+ assert(bytes_left() == 0); // alignment has to be big enough to hold checkpoint and next_metadata_cluster
+ return; // No more space
+ }
+
+ ondisk_type type = INVALID;
+ ondisk_checkpoint checkpoint;
+ std::memset(&checkpoint, 0, sizeof(checkpoint));
+
+ to_disk_buffer::append_bytes(&type, sizeof(type));
+ to_disk_buffer::append_bytes(&checkpoint, sizeof(checkpoint));
+
+ _crc.reset();
+ }
+
+ void prepare_unflushed_data_for_flush() noexcept override {
+ // Make checkpoint valid
+ constexpr ondisk_type checkpoint_type = CHECKPOINT;
+ size_t checkpoint_pos = _unflushed_data.beg + sizeof(checkpoint_type);
+ ondisk_checkpoint checkpoint;
+ checkpoint.checkpointed_data_length = _unflushed_data.end - checkpoint_pos - sizeof(checkpoint);
+ _crc.process_bytes(&checkpoint.checkpointed_data_length, sizeof(checkpoint.checkpointed_data_length));
+ checkpoint.crc32_code = _crc.checksum();
+
+ std::memcpy(_buff.get_write() + _unflushed_data.beg, &checkpoint_type, sizeof(checkpoint_type));
+ std::memcpy(_buff.get_write() + checkpoint_pos, &checkpoint, sizeof(checkpoint));
+ }
+
+public:
+ using to_disk_buffer::bytes_left_after_flush_if_done_now; // Explicitly stated that stays the same
+
+private:
+ void append_bytes(const void* data, size_t len) noexcept override {
+ to_disk_buffer::append_bytes(data, len);
+ _crc.process_bytes(data, len);
+ }
+
+public:
+ enum append_result {
+ APPENDED,
+ TOO_BIG,
+ };
+
+ [[nodiscard]] virtual append_result append(const ondisk_next_metadata_cluster& next_metadata_cluster) noexcept {
+ ondisk_type type = NEXT_METADATA_CLUSTER;
+ if (bytes_left() < ondisk_entry_size(next_metadata_cluster)) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&next_metadata_cluster, sizeof(next_metadata_cluster));
+ return APPENDED;
+ }
+
+ using to_disk_buffer::bytes_left;
+
+protected:
+ bool fits_for_append(size_t bytes_no) const noexcept {
+ // We need to reserve space for the next metadata cluster entry
+ return (bytes_left() >= bytes_no + sizeof(ondisk_type) + sizeof(ondisk_next_metadata_cluster));
+ }
+
+private:
+ template<class T>
+ [[nodiscard]] append_result append_simple(ondisk_type type, const T& entry) noexcept {
+ if (not fits_for_append(ondisk_entry_size(entry))) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&entry, sizeof(entry));
+ return APPENDED;
+ }
+
+public:
+ using to_disk_buffer::flush_to_disk;

+};
+
+} // namespace seastar::fs

diff --git a/src/fs/to_disk_buffer.hh b/src/fs/to_disk_buffer.hh
new file mode 100644
index 00000000..612f26d2
--- /dev/null
+++ b/src/fs/to_disk_buffer.hh
@@ -0,0 +1,138 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/block_device.hh"
+
+#include <cstring>

+
+namespace seastar::fs {
+

+// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
+// in order to finish construction.
+class to_disk_buffer {
+protected:
+ temporary_buffer<uint8_t> _buff;
+ size_t _max_size = 0;
+ unit_size_t _alignment = 0;
+ disk_offset_t _cluster_beg_offset = 0; // disk offset that corresponds to _buff.begin()
+ range<size_t> _unflushed_data = {0, 0}; // range of unflushed bytes in _buff
+
+public:
+ to_disk_buffer() = default;
+
+ to_disk_buffer(const to_disk_buffer&) = delete;
+ to_disk_buffer& operator=(const to_disk_buffer&) = delete;
+ to_disk_buffer(to_disk_buffer&&) = default;
+ to_disk_buffer& operator=(to_disk_buffer&&) = default;
+
+ // Total number of bytes appended cannot exceed @p aligned_max_size.
+ // @p cluster_beg_offset is the disk offset of the beginning of the cluster.
+ virtual void init(size_t aligned_max_size, unit_size_t alignment, disk_offset_t cluster_beg_offset) {
+ assert(is_power_of_2(alignment));
+ assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
+ assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
+
+ _max_size = aligned_max_size;
+ _alignment = alignment;
+ _cluster_beg_offset = cluster_beg_offset;
+ _unflushed_data = {0, 0};
+ _buff = decltype(_buff)::aligned(_alignment, _max_size);
+ start_new_unflushed_data();
+ }
+
+ virtual ~to_disk_buffer() = default;
+
+ /**
+ * @brief Writes buffered (unflushed) data to disk and starts a new unflushed data if there is enough space
+ * IMPORTANT: using this buffer before call to flush_to_disk() completes is perfectly OK
+ * @details After each flush we align the offset at which the new unflushed data is continued. This is very
+ * important, as it ensures that consecutive flushes, as their underlying write operations to a block device,
+ * do not overlap. If the writes overlapped, it would be possible that they would be written in the reverse order
+ * corrupting the on-disk data.
+ *
+ * @param device output device
+ */
+ virtual future<> flush_to_disk(block_device device) {
+ prepare_unflushed_data_for_flush();
+ // Data layout overview:
+ // |.........................|00000000000000000000000|
+ // ^ _unflushed_data.beg ^ _unflushed_data.end ^ real_write.end
+ // (aligned) (maybe unaligned) (aligned)
+ // == real_write.beg == new _unflushed_data.beg
+ // |<------ padding ------>|
+ assert(mod_by_power_of_2(_unflushed_data.beg, _alignment) == 0);
+ range real_write = {
+ _unflushed_data.beg,
+ round_up_to_multiple_of_power_of_2(_unflushed_data.end, _alignment),
+ };
+ // Pad buffer with zeros till alignment
+ range padding = {_unflushed_data.end, real_write.end};
+ std::memset(_buff.get_write() + padding.beg, 0, padding.size());
+
+ // Make sure the buffer is usable before returning from this function
+ _unflushed_data = {real_write.end, real_write.end};
+ start_new_unflushed_data();
+
+ return device.write(_cluster_beg_offset + real_write.beg, _buff.get_write() + real_write.beg, real_write.size())
+ .then([real_write](size_t written_bytes) {
+ if (written_bytes != real_write.size()) {
+ return make_exception_future<>(std::runtime_error("Partial write"));
+ // TODO: maybe add some way to retry write, because once the buffer is corrupt nothing can be done now
+ }
+
+ return now();
+ });
+ }
+
+protected:
+ // May be called before the flushing previous fragment is
+ virtual void start_new_unflushed_data() noexcept {}
+
+ virtual void prepare_unflushed_data_for_flush() noexcept {}
+
+public:
+ virtual void append_bytes(const void* data, size_t len) noexcept {
+ assert(len <= bytes_left());
+ std::memcpy(_buff.get_write() + _unflushed_data.end, data, len);
+ _unflushed_data.end += len;
+ }
+
+ // Returns maximum number of bytes that may be written to buffer without calling reset()
+ virtual size_t bytes_left() const noexcept { return _max_size - _unflushed_data.end; }
+
+ virtual size_t bytes_left_after_flush_if_done_now() const noexcept {
+ return _max_size - round_up_to_multiple_of_power_of_2(_unflushed_data.end, _alignment);
+ }
+
+ // Returns disk offset of the place where the first byte of next appended bytes would be after flush
+ // TODO: maybe better name for that function? Or any other method to extract that data?
+ virtual disk_offset_t current_disk_offset() const noexcept {
+ return _cluster_beg_offset + _unflushed_data.end;

+ }
+};
+
+} // namespace seastar::fs

diff --git a/src/fs/unix_metadata.hh b/src/fs/unix_metadata.hh
new file mode 100644
index 00000000..6f634044
--- /dev/null
+++ b/src/fs/unix_metadata.hh
@@ -0,0 +1,40 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include "seastar/core/file-types.hh"
+
+#include <cstdint>
+#include <sys/types.h>

+
+namespace seastar::fs {
+

+struct unix_metadata {
+ file_permissions perms;
+ uid_t uid;
+ gid_t gid;

+ uint64_t btime_ns;
+ uint64_t mtime_ns;
+ uint64_t ctime_ns;

+};
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
new file mode 100644
index 00000000..6e29f2e5
--- /dev/null
+++ b/src/fs/metadata_log.cc
@@ -0,0 +1,222 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#include "fs/cluster.hh"
+#include "fs/cluster_allocator.hh"
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/metadata_log_bootstrap.hh"
+#include "fs/metadata_to_disk_buffer.hh"
+#include "fs/path.hh"
+#include "fs/units.hh"
+#include "fs/unix_metadata.hh"
+#include "seastar/core/aligned_buffer.hh"
+#include "seastar/core/do_with.hh"
+#include "seastar/core/file-types.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/shared_mutex.hh"
+#include "seastar/fs/overloaded.hh"
+
+#include <boost/crc.hpp>
+#include <boost/range/irange.hpp>
+#include <chrono>
+#include <cstddef>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <stdexcept>
+#include <string_view>
+#include <unordered_set>
+#include <variant>

+
+namespace seastar::fs {
+

+metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t alignment,
+ shared_ptr<metadata_to_disk_buffer> cluster_buff)
+: _device(std::move(device))
+, _cluster_size(cluster_size)
+, _alignment(alignment)
+, _curr_cluster_buff(std::move(cluster_buff))
+, _cluster_allocator({}, {})
+, _inode_allocator(1, 0) {
+ assert(is_power_of_2(alignment));
+ assert(cluster_size > 0 and cluster_size % alignment == 0);
+}
+
+metadata_log::metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment)
+: metadata_log(std::move(device), cluster_size, alignment,
+ make_shared<metadata_to_disk_buffer>()) {}
+
+future<> metadata_log::bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
+ fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
+ return metadata_log_bootstrap::bootstrap(*this, root_dir, first_metadata_cluster_id, available_clusters,
+ fs_shards_pool_size, fs_shard_id);
+}
+
+future<> metadata_log::shutdown() {
+ return flush_log().then([this] {
+ return _device.close();
+ });
+}
+
+void metadata_log::schedule_flush_of_curr_cluster() {
+ // Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
+ schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {
+ return crr_clstr_bf->flush_to_disk(*device);
+ }));
+}
+
+future<> metadata_log::flush_curr_cluster() {
+ if (_curr_cluster_buff->bytes_left_after_flush_if_done_now() == 0) {
+ switch (schedule_flush_of_curr_cluster_and_change_it_to_new_one()) {
+ case flush_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case flush_result::DONE:
+ break;
+ }
+ } else {
+ schedule_flush_of_curr_cluster();
+ }
+
+ return _background_futures.get_future();
+}
+
+metadata_log::flush_result metadata_log::schedule_flush_of_curr_cluster_and_change_it_to_new_one() {
+ auto next_cluster = _cluster_allocator.alloc();
+ if (not next_cluster) {
+ // Here metadata log dies, we cannot even flush current cluster because from there we won't be able to recover
+ // TODO: ^ add protection from it and take it into account during compaction
+ return flush_result::NO_SPACE;
+ }
+
+ auto append_res = _curr_cluster_buff->append(ondisk_next_metadata_cluster {*next_cluster});
+ assert(append_res == metadata_to_disk_buffer::APPENDED);
+ schedule_flush_of_curr_cluster();
+
+ // Make next cluster the current cluster to allow writing next metadata entries before flushing finishes
+ _curr_cluster_buff->virtual_constructor();
+ _curr_cluster_buff->init(_cluster_size, _alignment,
+ cluster_id_to_offset(*next_cluster, _cluster_size));
+ return flush_result::DONE;
+}
+
+std::variant<inode_t, metadata_log::path_lookup_error> metadata_log::do_path_lookup(const std::string& path) const noexcept {
+ if (path.empty() or path[0] != '/') {
+ return path_lookup_error::NOT_ABSOLUTE;
+ }
+
+ std::vector<inode_t> components_stack = {_root_dir};
+ size_t beg = 0;
+ while (beg < path.size()) {
+ range component_range = {beg, path.find('/', beg)};
+ bool check_if_dir = false;
+ if (component_range.end == path.npos) {
+ component_range.end = path.size();
+ beg = path.size();
+ } else {
+ check_if_dir = true;
+ beg = component_range.end + 1; // Jump over '/'
+ }
+
+ std::string_view component(path.data() + component_range.beg, component_range.size());
+ // Process the component
+ if (component == "") {
+ continue;
+ } else if (component == ".") {
+ assert(component_range.beg > 0 and path[component_range.beg - 1] == '/' and "Since path is absolute we do not have to check if the current component is a directory");
+ continue;
+ } else if (component == "..") {
+ if (components_stack.size() > 1) { // Root dir cannot be popped
+ components_stack.pop_back();
+ }
+ } else {
+ auto dir_it = _inodes.find(components_stack.back());
+ assert(dir_it != _inodes.end() and "inode comes from some previous lookup (or is a root directory) hence dir_it has to be valid");
+ assert(dir_it->second.is_directory() and "every previous component is a directory and it was checked when they were processed");
+ auto& curr_dir = dir_it->second.get_directory();
+
+ auto it = curr_dir.entries.find(component);
+ if (it == curr_dir.entries.end()) {
+ return path_lookup_error::NO_ENTRY;
+ }
+
+ inode_t entry_inode = it->second;
+ if (check_if_dir) {
+ auto entry_it = _inodes.find(entry_inode);
+ assert(entry_it != _inodes.end() and "dir entries have to exist");
+ if (not entry_it->second.is_directory()) {
+ return path_lookup_error::NOT_DIR;
+ }
+ }
+
+ components_stack.emplace_back(entry_inode);
+ }
+ }
+
+ return components_stack.back();
+}
+
+future<inode_t> metadata_log::path_lookup(const std::string& path) const {
+ auto lookup_res = do_path_lookup(path);
+ return std::visit(overloaded {
+ [](path_lookup_error error) {
+ switch (error) {
+ case path_lookup_error::NOT_ABSOLUTE:
+ return make_exception_future<inode_t>(path_is_not_absolute_exception());
+ case path_lookup_error::NO_ENTRY:
+ return make_exception_future<inode_t>(no_such_file_or_directory_exception());
+ case path_lookup_error::NOT_DIR:
+ return make_exception_future<inode_t>(path_component_not_directory_exception());
+ }
+ __builtin_unreachable();
+ },
+ [](inode_t inode) {
+ return make_ready_future<inode_t>(inode);
+ }
+ }, lookup_res);
+}
+
+file_offset_t metadata_log::file_size(inode_t inode) const {
+ auto it = _inodes.find(inode);
+ if (it == _inodes.end()) {
+ throw invalid_inode_exception();
+ }
+
+ return std::visit(overloaded {
+ [](const inode_info::file& file) {
+ return file.size();
+ },
+ [](const inode_info::directory&) -> file_offset_t {
+ throw invalid_inode_exception();
+ }
+ }, it->second.contents);
+}
+
+// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
+// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
+// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
+// hard if we write metadata to the last cluster and there is no enough room to write these delete operations. We have to
+// guarantee that the filesystem is in a recoverable state then.

+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
new file mode 100644
index 00000000..926d79fe
--- /dev/null
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -0,0 +1,264 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#include "fs/bitwise.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log_bootstrap.hh"
+#include "seastar/util/log.hh"

+
+namespace seastar::fs {
+

+namespace {
+logger mlogger("fs_metadata_bootstrap");
+} // namespace
+
+bool data_reader::read(void* destination, size_t size) {
+ if (_pos + size > _size) {
+ return false;
+ }
+
+ std::memcpy(destination, _data + _pos, size);
+ _pos += size;
+ return true;
+}
+
+bool data_reader::read_string(std::string& str, size_t size) {
+ str.resize(size);
+ return read(str.data(), size);
+}
+
+std::optional<temporary_buffer<uint8_t>> data_reader::read_tmp_buff(size_t size) {
+ if (_pos + size > _size) {

+ return std::nullopt;
+ }
+

+ _pos += size;
+ return temporary_buffer<uint8_t>(_data + _pos - size, size);
+}
+
+bool data_reader::process_crc_without_reading(boost::crc_32_type& crc, size_t size) {
+ if (_pos + size > _size) {
+ return false;
+ }
+
+ crc.process_bytes(_data + _pos, size);
+ return true;
+}
+
+std::optional<data_reader> data_reader::extract(size_t size) {
+ if (_pos + size > _size) {

+ return std::nullopt;
+ }
+

+ _pos += size;
+ return data_reader(_data + _pos - size, size);
+}
+
+metadata_log_bootstrap::metadata_log_bootstrap(metadata_log& metadata_log, cluster_range available_clusters)
+: _metadata_log(metadata_log)
+, _available_clusters(available_clusters)
+, _curr_cluster_data(decltype(_curr_cluster_data)::aligned(metadata_log._alignment, metadata_log._cluster_size))
+{}
+
+future<> metadata_log_bootstrap::bootstrap(cluster_id_t first_metadata_cluster_id, fs_shard_id_t fs_shards_pool_size,
+ fs_shard_id_t fs_shard_id) {
+ _next_cluster = first_metadata_cluster_id;
+ mlogger.debug(">>>> Started bootstraping <<<<");
+ return do_with((cluster_id_t)first_metadata_cluster_id, [this](cluster_id_t& last_cluster) {
+ return do_until([this] { return not _next_cluster.has_value(); }, [this, &last_cluster] {
+ cluster_id_t curr_cluster = *_next_cluster;
+ _next_cluster = std::nullopt;
+ bool inserted = _taken_clusters.emplace(curr_cluster).second;
+ assert(inserted); // TODO: check it in next_cluster record
+ last_cluster = curr_cluster;
+ return bootstrap_cluster(curr_cluster);
+ }).then([this, &last_cluster] {
+ mlogger.debug("Data bootstraping is done");
+ // Initialize _curr_cluster_buff
+ _metadata_log._curr_cluster_buff = _metadata_log._curr_cluster_buff->virtual_constructor();
+ mlogger.debug("Initializing _curr_cluster_buff: cluster {}, pos {}", last_cluster, _curr_cluster.last_checkpointed_pos());
+ _metadata_log._curr_cluster_buff->init_from_bootstrapped_cluster(_metadata_log._cluster_size,
+ _metadata_log._alignment, cluster_id_to_offset(last_cluster, _metadata_log._cluster_size),
+ _curr_cluster.last_checkpointed_pos());
+ });
+ }).then([this, fs_shards_pool_size, fs_shard_id] {
+ // Initialize _cluser_allocator
+ mlogger.debug("Initializing cluster allocator");
+ std::deque<cluster_id_t> free_clusters;
+ for (auto cid : boost::irange(_available_clusters.beg, _available_clusters.end)) {
+ if (_taken_clusters.count(cid) == 0) {
+ free_clusters.emplace_back(cid);
+ }
+ }
+ if (free_clusters.empty()) {
+ return make_exception_future(no_more_space_exception());
+ }
+ free_clusters.pop_front();
+
+ mlogger.debug("free clusters: {}", free_clusters.size());
+ _metadata_log._cluster_allocator = cluster_allocator(std::move(_taken_clusters), std::move(free_clusters));
+
+ // Reset _inode_allocator
+ std::optional<inode_t> max_inode_no;
+ if (not _metadata_log._inodes.empty()) {
+ max_inode_no =_metadata_log._inodes.rbegin()->first;
+ }
+ _metadata_log._inode_allocator = shard_inode_allocator(fs_shards_pool_size, fs_shard_id, max_inode_no);
+
+ // TODO: what about orphaned inodes: maybe they are remnants of unlinked files and we need to delete them,
+ // or maybe not?
+ return now();
+ });
+}
+
+future<> metadata_log_bootstrap::bootstrap_cluster(cluster_id_t curr_cluster) {
+ disk_offset_t curr_cluster_disk_offset = cluster_id_to_offset(curr_cluster, _metadata_log._cluster_size);
+ mlogger.debug("Bootstraping from cluster {}...", curr_cluster);
+ return _metadata_log._device.read(curr_cluster_disk_offset, _curr_cluster_data.get_write(),
+ _metadata_log._cluster_size).then([this, curr_cluster](size_t bytes_read) {
+ if (bytes_read != _metadata_log._cluster_size) {
+ return make_exception_future(std::runtime_error("Failed to read whole cluster of the metadata log"));
+ }
+
+ mlogger.debug("Read cluster {}", curr_cluster);
+ _curr_cluster = data_reader(_curr_cluster_data.get(), _metadata_log._cluster_size);
+ return bootstrap_read_cluster();
+ });
+}
+
+future<> metadata_log_bootstrap::bootstrap_read_cluster() {
+ // Process cluster: the data layout format is:
+ // | checkpoint1 | data1... | checkpoint2 | data2... | ... |
+ return do_with(false, [this](bool& whole_log_ended) {
+ return do_until([this, &whole_log_ended] { return whole_log_ended or _next_cluster.has_value(); },
+ [this, &whole_log_ended] {
+ _curr_cluster.align_curr_pos(_metadata_log._alignment);
+ _curr_cluster.checkpoint_curr_pos();
+
+ if (not read_and_check_checkpoint()) {
+ mlogger.debug("Checkpoint invalid");
+ whole_log_ended = true;

+ return now();
+ }
+

+ mlogger.debug("Checkpoint correct");
+ return bootstrap_checkpointed_data();
+ }).then([] {
+ mlogger.debug("Cluster ended");
+ });
+ });
+}
+
+bool metadata_log_bootstrap::read_and_check_checkpoint() {
+ mlogger.debug("Processing checkpoint at {}", _curr_cluster.curr_pos());
+ ondisk_type entry_type;
+ ondisk_checkpoint checkpoint;
+ if (not _curr_cluster.read_entry(entry_type)) {
+ mlogger.debug("Cannot read entry type");
+ return false;
+ }
+ if (entry_type != CHECKPOINT) {
+ mlogger.debug("Entry type (= {}) is not CHECKPOINT (= {})", entry_type, CHECKPOINT);
+ return false;
+ }
+ if (not _curr_cluster.read_entry(checkpoint)) {
+ mlogger.debug("Cannot read checkpoint entry");
+ return false;
+ }
+
+ boost::crc_32_type crc;
+ if (not _curr_cluster.process_crc_without_reading(crc, checkpoint.checkpointed_data_length)) {
+ mlogger.debug("Invalid checkpoint's data length: {}", (unit_size_t)checkpoint.checkpointed_data_length);
+ return false;
+ }
+ crc.process_bytes(&checkpoint.checkpointed_data_length, sizeof(checkpoint.checkpointed_data_length));
+ if (crc.checksum() != checkpoint.crc32_code) {
+ mlogger.debug("CRC code does not match: computed = {}, read = {}", crc.checksum(), (uint32_t)checkpoint.crc32_code);
+ return false;
+ }
+
+ auto opt = _curr_cluster.extract(checkpoint.checkpointed_data_length);
+ assert(opt.has_value());
+ _curr_checkpoint = *opt;
+ return true;
+}
+
+future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
+ return do_with(ondisk_type {}, [this](ondisk_type& entry_type) {
+ return do_until([this, &entry_type] { return not _curr_checkpoint.read_entry(entry_type); },
+ [this, &entry_type] {
+ switch (entry_type) {
+ case INVALID:
+ case CHECKPOINT: // CHECKPOINT cannot appear as part of checkpointed data
+ return invalid_entry_exception();
+ case NEXT_METADATA_CLUSTER:
+ return bootstrap_next_metadata_cluster();
+ }
+
+ // unknown type => metadata log corruption
+ return invalid_entry_exception();
+ }).then([this] {
+ if (_curr_checkpoint.bytes_left() > 0) {
+ return invalid_entry_exception(); // Corrupted checkpointed data
+ }
+ return now();
+ });
+ });
+}
+
+future<> metadata_log_bootstrap::bootstrap_next_metadata_cluster() {
+ ondisk_next_metadata_cluster entry;
+ if (not _curr_checkpoint.read_entry(entry)) {

+ return invalid_entry_exception();
+ }
+

+ if (_next_cluster.has_value()) {
+ return invalid_entry_exception(); // Only one NEXT_METADATA_CLUSTER may appear in one cluster
+ }
+
+ _next_cluster = (cluster_id_t)entry.cluster_id;

+ return now();
+}
+

+bool metadata_log_bootstrap::inode_exists(inode_t inode) {
+ return _metadata_log._inodes.count(inode) != 0;
+}
+
+future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
+ cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
+ // Clear the metadata log
+ metadata_log._inodes.clear();
+ metadata_log._background_futures = now();
+ metadata_log._root_dir = root_dir;
+ metadata_log._inodes.emplace(root_dir, inode_info {
+ 0,
+ 0,
+ {}, // TODO: change it to something meaningful
+ inode_info::directory {}
+ });
+
+ return do_with(metadata_log_bootstrap(metadata_log, available_clusters),
+ [first_metadata_cluster_id, fs_shards_pool_size, fs_shard_id](metadata_log_bootstrap& bootstrap) {
+ return bootstrap.bootstrap(first_metadata_cluster_id, fs_shards_pool_size, fs_shard_id);

+ });
+}
+
+} // namespace seastar::fs

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8a59eca6..19666a8a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -658,6 +658,7 @@ if (Seastar_EXPERIMENTAL_FS)
PRIVATE

# SeastarFS source files
include/seastar/fs/block_device.hh

+ include/seastar/fs/exceptions.hh
include/seastar/fs/file.hh
include/seastar/fs/overloaded.hh
include/seastar/fs/temporary_file.hh
@@ -670,9 +671,18 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/crc.hh
src/fs/file.cc
src/fs/inode.hh
+ src/fs/inode_info.hh
+ src/fs/metadata_disk_entries.hh
+ src/fs/metadata_log.cc
+ src/fs/metadata_log.hh
+ src/fs/metadata_log_bootstrap.cc
+ src/fs/metadata_log_bootstrap.hh
+ src/fs/metadata_to_disk_buffer.hh
src/fs/path.hh
src/fs/range.hh
+ src/fs/to_disk_buffer.hh
src/fs/units.hh
+ src/fs/unix_metadata.hh

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:33 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

src/fs/metadata_disk_entries.hh | 11 ++
src/fs/metadata_log.hh | 8 +
src/fs/metadata_log_bootstrap.hh | 2 +
src/fs/metadata_log_operations/create_file.hh | 174 ++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 13 ++
src/fs/metadata_log.cc | 24 +++
src/fs/metadata_log_bootstrap.cc | 30 +++
CMakeLists.txt | 1 +
8 files changed, 263 insertions(+)
create mode 100644 src/fs/metadata_log_operations/create_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 437c2c2b..9c44b8cc 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -73,6 +73,7 @@ enum ondisk_type : uint8_t {
CHECKPOINT,
NEXT_METADATA_CLUSTER,
CREATE_INODE,
+ CREATE_INODE_AS_DIR_ENTRY,
};

struct ondisk_checkpoint {
@@ -102,11 +103,21 @@ struct ondisk_create_inode {
ondisk_unix_metadata metadata;
} __attribute__((packed));

+struct ondisk_create_inode_as_dir_entry_header {
+ ondisk_create_inode entry_inode;
+ inode_t dir_inode;
+ uint16_t entry_name_length;
+ // After header comes entry name

+} __attribute__((packed));
+
template<typename T>

constexpr size_t ondisk_entry_size(const T& entry) noexcept {

static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or

std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}
+constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {
+ return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;

+}

} // namespace seastar::fs
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh

index 6f069c13..cc11b865 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -157,6 +157,7 @@ class metadata_log {
friend class metadata_log_bootstrap;

friend class create_and_open_unlinked_file_operation;
+ friend class create_file_operation;

public:

metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,

@@ -179,6 +180,7 @@ class metadata_log {

}

inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);

+ void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);

template<class Func>
void schedule_background_task(Func&& task) {

@@ -290,8 +292,14 @@ class metadata_log {

// Returns size of the file or throws exception iff @p inode is invalid

file_offset_t file_size(inode_t inode) const;

+ future<> create_file(std::string path, file_permissions perms);
+
+ future<inode_t> create_and_open_file(std::string path, file_permissions perms);

+
future<inode_t> create_and_open_unlinked_file(file_permissions perms);

+ future<> create_directory(std::string path, file_permissions perms);

+
// All disk-related errors will be exposed here

future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh

index 4a1fa7e9..d44c2f96 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -117,6 +117,8 @@ class metadata_log_bootstrap {

future<> bootstrap_create_inode();

+ future<> bootstrap_create_inode_as_dir_entry();
+
public:

static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,

cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);

diff --git a/src/fs/metadata_log_operations/create_file.hh b/src/fs/metadata_log_operations/create_file.hh
new file mode 100644
index 00000000..3ba83226
--- /dev/null
+++ b/src/fs/metadata_log_operations/create_file.hh
@@ -0,0 +1,174 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/metadata_log.hh"
+#include "fs/path.hh"
+#include "seastar/core/future.hh"

+
+namespace seastar::fs {
+

+enum class create_semantics {
+ CREATE_FILE,
+ CREATE_AND_OPEN_FILE,
+ CREATE_DIR,
+};
+
+class create_file_operation {
+ metadata_log& _metadata_log;
+ create_semantics _create_semantics;
+ std::string _entry_name;
+ file_permissions _perms;
+ inode_t _dir_inode;
+ inode_info::directory* _dir_info;
+
+ create_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<inode_t> create_file(std::string path, file_permissions perms, create_semantics create_semantics) {
+ _create_semantics = create_semantics;
+ switch (create_semantics) {
+ case create_semantics::CREATE_FILE:
+ case create_semantics::CREATE_AND_OPEN_FILE:
+ break;
+ case create_semantics::CREATE_DIR:
+ while (not path.empty() and path.back() == '/') {
+ path.pop_back();
+ }
+ }
+
+ _entry_name = extract_last_component(path);
+ if (_entry_name.empty()) {
+ return make_exception_future<inode_t>(invalid_path_exception());
+ }
+ assert(path.empty() or path.back() == '/'); // Hence fast-checking for "is directory" is done in path_lookup
+
+ _perms = perms;
+ return _metadata_log.path_lookup(path).then([this](inode_t dir_inode) {
+ _dir_inode = dir_inode;
+ // Fail-fast checks before locking (as locking may be expensive)
+ auto dir_it = _metadata_log._inodes.find(_dir_inode);
+ if (dir_it == _metadata_log._inodes.end()) {
+ return make_exception_future<inode_t>(operation_became_invalid_exception());
+ }
+ assert(dir_it->second.is_directory() and "Directory cannot become file or there is a BUG in path_lookup");
+ _dir_info = &dir_it->second.get_directory();
+
+ if (_dir_info->entries.count(_entry_name) != 0) {
+ return make_exception_future<inode_t>(file_already_exists_exception());
+ }
+
+ return _metadata_log._locks.with_locks(metadata_log::locks::shared {dir_inode},
+ metadata_log::locks::unique {dir_inode, _entry_name}, [this] {
+ return create_file_in_directory();

+ });
+ });
+ }
+

+ future<inode_t> create_file_in_directory() {
+ if (not _metadata_log.inode_exists(_dir_inode)) {
+ return make_exception_future<inode_t>(operation_became_invalid_exception());
+ }
+
+ if (_dir_info->entries.count(_entry_name) != 0) {
+ return make_exception_future<inode_t>(file_already_exists_exception());
+ }
+
+ ondisk_create_inode_as_dir_entry_header ondisk_entry;
+ decltype(ondisk_entry.entry_name_length) entry_name_length;
+ if (_entry_name.size() > std::numeric_limits<decltype(entry_name_length)>::max()) {
+ // TODO: add an assert that the culster_size is not too small as it would cause to allocate all clusters
+ // and then return error ENOSPACE
+ return make_exception_future<inode_t>(filename_too_long_exception());
+ }
+ entry_name_length = _entry_name.size();
+

+ using namespace std::chrono;
+ uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ unix_metadata unx_mtdt = {

+ _perms,

+ 0, // TODO: Eventually, we'll want a user to be able to pass his credentials when bootstrapping the
+ 0, // file system -- that will allow us to authorize users on startup (e.g. via LDAP or whatnot).
+ curr_time_ns,
+ curr_time_ns,
+ curr_time_ns
+ };
+

+ bool creating_dir = [this] {
+ switch (_create_semantics) {
+ case create_semantics::CREATE_FILE:
+ case create_semantics::CREATE_AND_OPEN_FILE:
+ return false;
+ case create_semantics::CREATE_DIR:
+ return true;
+ }
+ __builtin_unreachable();
+ }();

+
+ inode_t new_inode = _metadata_log._inode_allocator.alloc();
+

+ ondisk_entry = {
+ {
+ new_inode,
+ creating_dir,
+ metadata_to_ondisk_metadata(unx_mtdt)
+ },
+ _dir_inode,
+ entry_name_length,
+ };
+
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {

+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future<inode_t>(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future<inode_t>(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ inode_info& new_inode_info = _metadata_log.memory_only_create_inode(new_inode,

+ creating_dir, unx_mtdt);
+ _metadata_log.memory_only_add_dir_entry(*_dir_info, new_inode, std::move(_entry_name));
+
+ switch (_create_semantics) {
+ case create_semantics::CREATE_FILE:
+ case create_semantics::CREATE_DIR:
+ break;
+ case create_semantics::CREATE_AND_OPEN_FILE:

+ // We don't have to lock, as there was no context switch since the allocation of the inode number
+ ++new_inode_info.opened_files_count;

+ break;
+ }
+
+ return make_ready_future<inode_t>(new_inode);

+ }
+ __builtin_unreachable();
+ }
+

+public:
+ static future<inode_t> perform(metadata_log& metadata_log, std::string path, file_permissions perms,
+ create_semantics create_semantics) {
+ return do_with(create_file_operation(metadata_log),
+ [path = std::move(path), perms = std::move(perms), create_semantics](auto& obj) {
+ return obj.create_file(std::move(path), std::move(perms), create_semantics);

+ });
+ }
+};
+

+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

index 593ad46a..87b2bd8e 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -157,6 +157,19 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(CREATE_INODE, create_inode);
}

+ [[nodiscard]] virtual append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,
+ const void* entry_name) noexcept {
+ ondisk_type type = CREATE_INODE_AS_DIR_ENTRY;
+ if (not fits_for_append(ondisk_entry_size(create_inode_as_dir_entry))) {

+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));

+ append_bytes(&create_inode_as_dir_entry, sizeof(create_inode_as_dir_entry));
+ append_bytes(entry_name, create_inode_as_dir_entry.entry_name_length);

+ return APPENDED;
+ }
+

using to_disk_buffer::flush_to_disk;
};

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc

index be523fc7..d35d3710 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -27,6 +27,7 @@
#include "fs/metadata_log.hh"
#include "fs/metadata_log_bootstrap.hh"
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
+#include "fs/metadata_log_operations/create_file.hh"

#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"

@@ -97,6 +98,17 @@ inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_direct
}).first->second;
}

+void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {
+ auto it = _inodes.find(entry_inode);
+ assert(it != _inodes.end());
+ // Directory may only be linked once (to avoid creating cycles)
+ assert(not it->second.is_directory() or not it->second.is_linked());
+
+ bool inserted = dir.entries.emplace(std::move(entry_name), entry_inode).second;
+ assert(inserted);
+ ++it->second.directories_containing_file;

+}
+
void metadata_log::schedule_flush_of_curr_cluster() {

// Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)

schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {

@@ -230,10 +242,22 @@ file_offset_t metadata_log::file_size(inode_t inode) const {
}, it->second.contents);
}

+future<> metadata_log::create_file(std::string path, file_permissions perms) {
+ return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_FILE).discard_result();
+}
+
+future<inode_t> metadata_log::create_and_open_file(std::string path, file_permissions perms) {
+ return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_AND_OPEN_FILE);
+}
+
future<inode_t> metadata_log::create_and_open_unlinked_file(file_permissions perms) {

return create_and_open_unlinked_file_operation::perform(*this, std::move(perms));
}

+future<> metadata_log::create_directory(std::string path, file_permissions perms) {
+ return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_DIR).discard_result();
+}
+

// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,

// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()

// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially

diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 702e0e34..01b567f0 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -213,6 +213,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_next_metadata_cluster();
case CREATE_INODE:
return bootstrap_create_inode();
+ case CREATE_INODE_AS_DIR_ENTRY:
+ return bootstrap_create_inode_as_dir_entry();

}

// unknown type => metadata log corruption

@@ -255,6 +257,34 @@ future<> metadata_log_bootstrap::bootstrap_create_inode() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
+ ondisk_create_inode_as_dir_entry_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
+ inode_exists(entry.entry_inode.inode)) {

+ return invalid_entry_exception();
+ }
+

+ std::string dir_entry_name;
+ if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {

+ return invalid_entry_exception();
+ }
+

+ if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
+ return invalid_entry_exception();
+ }
+ auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
+
+ if (dir.entries.count(dir_entry_name) != 0) {

+ return invalid_entry_exception();
+ }
+

+ _metadata_log.memory_only_create_inode(entry.entry_inode.inode, entry.entry_inode.is_directory,
+ ondisk_metadata_to_metadata(entry.entry_inode.metadata));
+ _metadata_log.memory_only_add_dir_entry(dir, entry.entry_inode.inode, std::move(dir_entry_name));
+ // TODO: Maybe mtime_ns for modifying directory?

+ return now();
+}
+

future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,

cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {

// Clear the metadata log

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 3304a02b..46cdf803 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -678,6 +678,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_bootstrap.cc
src/fs/metadata_log_bootstrap.hh
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
+ src/fs/metadata_log_operations/create_file.hh

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:34 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Some operations need to schedule deleting inode in the background. One
of these is closing unlinked file if nobody else holds it open.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

src/fs/metadata_disk_entries.hh | 8 ++++++-
src/fs/metadata_log.hh | 3 +++
src/fs/metadata_log_bootstrap.hh | 2 ++
src/fs/metadata_to_disk_buffer.hh | 4 ++++
src/fs/metadata_log.cc | 38 +++++++++++++++++++++++++++++++
src/fs/metadata_log_bootstrap.cc | 21 +++++++++++++++++
6 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 9c44b8cc..310b1864 100644

--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -73,6 +73,7 @@ enum ondisk_type : uint8_t {
CHECKPOINT,
NEXT_METADATA_CLUSTER,
CREATE_INODE,

+ DELETE_INODE,
CREATE_INODE_AS_DIR_ENTRY,
};

@@ -103,6 +104,10 @@ struct ondisk_create_inode {
ondisk_unix_metadata metadata;
} __attribute__((packed));

+struct ondisk_delete_inode {
+ inode_t inode;
+} __attribute__((packed));
+
struct ondisk_create_inode_as_dir_entry_header {
ondisk_create_inode entry_inode;
inode_t dir_inode;
@@ -113,7 +118,8 @@ struct ondisk_create_inode_as_dir_entry_header {

template<typename T>
constexpr size_t ondisk_entry_size(const T& entry) noexcept {
static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or

- std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
+ std::is_same_v<T, ondisk_create_inode> or
+ std::is_same_v<T, ondisk_delete_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}

constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index cc11b865..be5e843b 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -180,6 +180,7 @@ class metadata_log {

}

inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);

+ void memory_only_delete_inode(inode_t inode);

void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);

template<class Func>

@@ -232,6 +233,8 @@ class metadata_log {
__builtin_unreachable();
}

+ void schedule_attempt_to_delete_inode(inode_t inode);
+
enum class path_lookup_error {

NOT_ABSOLUTE, // a path is not absolute

NO_ENTRY, // no such file or directory

diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index d44c2f96..b28bce7f 100644

--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -117,6 +117,8 @@ class metadata_log_bootstrap {

future<> bootstrap_create_inode();

+ future<> bootstrap_delete_inode();
+
future<> bootstrap_create_inode_as_dir_entry();

public:
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index 87b2bd8e..9eb1c538 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -157,6 +157,10 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(CREATE_INODE, create_inode);
}

+ [[nodiscard]] virtual append_result append(const ondisk_delete_inode& delete_inode) noexcept {
+ return append_simple(DELETE_INODE, delete_inode);
+ }

+
[[nodiscard]] virtual append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,

const void* entry_name) noexcept {
ondisk_type type = CREATE_INODE_AS_DIR_ENTRY;
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index d35d3710..7f42f353 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -98,6 +98,24 @@ inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_direct
}).first->second;
}

+void metadata_log::memory_only_delete_inode(inode_t inode) {

+ auto it = _inodes.find(inode);

+ assert(it != _inodes.end());
+ assert(not it->second.is_open());
+ assert(not it->second.is_linked());
+
+ std::visit(overloaded {
+ [](const inode_info::directory& dir) {
+ assert(dir.entries.empty());
+ },
+ [](const inode_info::file&) {
+ // TODO: for compaction: update used inode_data_vec

+ }
+ }, it->second.contents);
+

+ _inodes.erase(it);
+}
+

void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {

auto it = _inodes.find(entry_inode);
assert(it != _inodes.end());
@@ -150,6 +168,26 @@ metadata_log::flush_result metadata_log::schedule_flush_of_curr_cluster_and_chan
return flush_result::DONE;
}

+void metadata_log::schedule_attempt_to_delete_inode(inode_t inode) {
+ return schedule_background_task([this, inode] {

+ auto it = _inodes.find(inode);

+ if (it == _inodes.end() or it->second.is_linked() or it->second.is_open()) {
+ return now(); // Scheduled delete became invalid
+ }
+
+ switch (append_ondisk_entry(ondisk_delete_inode {inode})) {
+ case append_result::TOO_BIG:
+ assert(false and "ondisk entry cannot be too big");
+ case append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case append_result::APPENDED:
+ memory_only_delete_inode(inode);
+ return now();
+ }
+ __builtin_unreachable();
+ });
+}
+

std::variant<inode_t, metadata_log::path_lookup_error> metadata_log::do_path_lookup(const std::string& path) const noexcept {

if (path.empty() or path[0] != '/') {

return path_lookup_error::NOT_ABSOLUTE;
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 01b567f0..3058328a 100644

--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -213,6 +213,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_next_metadata_cluster();
case CREATE_INODE:
return bootstrap_create_inode();

+ case DELETE_INODE:
+ return bootstrap_delete_inode();
case CREATE_INODE_AS_DIR_ENTRY:
return bootstrap_create_inode_as_dir_entry();
}
@@ -257,6 +259,25 @@ future<> metadata_log_bootstrap::bootstrap_create_inode() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_delete_inode() {
+ ondisk_delete_inode entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

+ return invalid_entry_exception();
+ }
+

+ inode_info& inode_info = _metadata_log._inodes.at(entry.inode);
+ if (inode_info.directories_containing_file > 0) {
+ return invalid_entry_exception(); // Only unlinked inodes may be deleted
+ }
+
+ if (inode_info.is_directory() and not inode_info.get_directory().entries.empty()) {
+ return invalid_entry_exception(); // Only empty directories may be deleted
+ }
+
+ _metadata_log.memory_only_delete_inode(entry.inode);

+ return now();
+}
+

future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
ondisk_create_inode_as_dir_entry_header entry;

if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:37 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Allows the same file to be visible via different paths or to give a path
to an unlinked file.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

src/fs/metadata_disk_entries.hh | 11 ++
src/fs/metadata_log.hh | 7 ++
src/fs/metadata_log_bootstrap.hh | 2 +
src/fs/metadata_log_operations/link_file.hh | 112 ++++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 12 +++
src/fs/metadata_log.cc | 11 ++
src/fs/metadata_log_bootstrap.cc | 34 ++++++
CMakeLists.txt | 1 +
8 files changed, 190 insertions(+)
create mode 100644 src/fs/metadata_log_operations/link_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 310b1864..b81c25f5 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -74,6 +74,7 @@ enum ondisk_type : uint8_t {
NEXT_METADATA_CLUSTER,
CREATE_INODE,
DELETE_INODE,
+ ADD_DIR_ENTRY,
CREATE_INODE_AS_DIR_ENTRY,
};

@@ -108,6 +109,13 @@ struct ondisk_delete_inode {
inode_t inode;
} __attribute__((packed));

+struct ondisk_add_dir_entry_header {
+ inode_t dir_inode;
+ inode_t entry_inode;

+ uint16_t entry_name_length;
+ // After header comes entry name

+} __attribute__((packed));
+
struct ondisk_create_inode_as_dir_entry_header {
ondisk_create_inode entry_inode;
inode_t dir_inode;

@@ -122,6 +130,9 @@ constexpr size_t ondisk_entry_size(const T& entry) noexcept {

std::is_same_v<T, ondisk_delete_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}

+constexpr size_t ondisk_entry_size(const ondisk_add_dir_entry_header& entry) noexcept {

+ return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
+}

constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {

return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
}

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index be5e843b..f5373458 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -158,6 +158,7 @@ class metadata_log {

friend class create_and_open_unlinked_file_operation;
friend class create_file_operation;
+ friend class link_file_operation;

public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,

@@ -303,6 +304,12 @@ class metadata_log {

future<> create_directory(std::string path, file_permissions perms);

+ // Creates name (@p path) for a file (@p inode)
+ future<> link_file(inode_t inode, std::string path);
+
+ // Creates name (@p destination) for a file (not directory) @p source
+ future<> link_file(std::string source, std::string destination);

+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();

diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index b28bce7f..3b38b232 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -119,6 +119,8 @@ class metadata_log_bootstrap {

future<> bootstrap_delete_inode();

+ future<> bootstrap_add_dir_entry();
+
future<> bootstrap_create_inode_as_dir_entry();

public:
diff --git a/src/fs/metadata_log_operations/link_file.hh b/src/fs/metadata_log_operations/link_file.hh
new file mode 100644
index 00000000..207fe327
--- /dev/null
+++ b/src/fs/metadata_log_operations/link_file.hh
@@ -0,0 +1,112 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/path.hh"

+
+namespace seastar::fs {
+

+class link_file_operation {
+ metadata_log& _metadata_log;
+ inode_t _src_inode;
+ std::string _entry_name;

+ inode_t _dir_inode;
+ inode_info::directory* _dir_info;
+

+ link_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<> link_file(inode_t inode, std::string path) {
+ _src_inode = inode;

+ _entry_name = extract_last_component(path);
+ if (_entry_name.empty()) {

+ return make_exception_future(is_directory_exception());

+ }
+ assert(path.empty() or path.back() == '/'); // Hence fast-checking for "is directory" is done in path_lookup
+

+ return _metadata_log.path_lookup(path).then([this](inode_t dir_inode) {
+ _dir_inode = dir_inode;
+ // Fail-fast checks before locking (as locking may be expensive)
+ auto dir_it = _metadata_log._inodes.find(_dir_inode);
+ if (dir_it == _metadata_log._inodes.end()) {

+ return make_exception_future(operation_became_invalid_exception());

+ }
+ assert(dir_it->second.is_directory() and "Directory cannot become file or there is a BUG in path_lookup");
+ _dir_info = &dir_it->second.get_directory();
+
+ if (_dir_info->entries.count(_entry_name) != 0) {

+ return make_exception_future(file_already_exists_exception());

+ }
+
+ return _metadata_log._locks.with_locks(metadata_log::locks::shared {dir_inode},
+ metadata_log::locks::unique {dir_inode, _entry_name}, [this] {

+ return link_file_in_directory();

+ });
+ });
+ }
+

+ future<> link_file_in_directory() {
+ if (not _metadata_log.inode_exists(_dir_inode)) {
+ return make_exception_future(operation_became_invalid_exception());

+ }
+
+ if (_dir_info->entries.count(_entry_name) != 0) {

+ return make_exception_future(file_already_exists_exception());
+ }
+
+ ondisk_add_dir_entry_header ondisk_entry;

+ decltype(ondisk_entry.entry_name_length) entry_name_length;
+ if (_entry_name.size() > std::numeric_limits<decltype(entry_name_length)>::max()) {
+ // TODO: add an assert that the culster_size is not too small as it would cause to allocate all clusters
+ // and then return error ENOSPACE

+ return make_exception_future(filename_too_long_exception());

+ }
+ entry_name_length = _entry_name.size();
+

+ ondisk_entry = {
+ _dir_inode,
+ _src_inode,

+ entry_name_length,
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {
+ case metadata_log::append_result::TOO_BIG:

+ return make_exception_future(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ _metadata_log.memory_only_add_dir_entry(*_dir_info, _src_inode, std::move(_entry_name));

+ return now();
+ }
+ __builtin_unreachable();
+ }

+
+public:
+ static future<> perform(metadata_log& metadata_log, inode_t inode, std::string path) {
+ return do_with(link_file_operation(metadata_log), [inode, path = std::move(path)](auto& obj) {
+ return obj.link_file(inode, std::move(path));

+ });
+ }
+};
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index 9eb1c538..38180224 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -161,6 +161,18 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(DELETE_INODE, delete_inode);
}

+ [[nodiscard]] virtual append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept {
+ ondisk_type type = ADD_DIR_ENTRY;
+ if (not fits_for_append(ondisk_entry_size(add_dir_entry))) {

+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));

+ append_bytes(&add_dir_entry, sizeof(add_dir_entry));
+ append_bytes(entry_name, add_dir_entry.entry_name_length);
+ return APPENDED;

+ }
+
[[nodiscard]] virtual append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,
const void* entry_name) noexcept {
ondisk_type type = CREATE_INODE_AS_DIR_ENTRY;
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc

index 7f42f353..a8b17c2b 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -28,6 +28,7 @@
#include "fs/metadata_log_bootstrap.hh"
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_log_operations/create_file.hh"
+#include "fs/metadata_log_operations/link_file.hh"

#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"

@@ -296,6 +297,16 @@ future<> metadata_log::create_directory(std::string path, file_permissions perms

return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_DIR).discard_result();
}

+future<> metadata_log::link_file(inode_t inode, std::string path) {
+ return link_file_operation::perform(*this, inode, std::move(path));
+}
+
+future<> metadata_log::link_file(std::string source, std::string destination) {
+ return path_lookup(std::move(source)).then([this, destination = std::move(destination)](inode_t inode) {
+ return link_file(inode, std::move(destination));
+ });

+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially

diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 3058328a..64396d11 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -215,6 +215,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_create_inode();
case DELETE_INODE:
return bootstrap_delete_inode();
+ case ADD_DIR_ENTRY:
+ return bootstrap_add_dir_entry();
case CREATE_INODE_AS_DIR_ENTRY:
return bootstrap_create_inode_as_dir_entry();
}
@@ -278,6 +280,38 @@ future<> metadata_log_bootstrap::bootstrap_delete_inode() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_add_dir_entry() {
+ ondisk_add_dir_entry_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
+ not inode_exists(entry.entry_inode)) {

+ return invalid_entry_exception();
+ }
+

+ std::string dir_entry_name;
+ if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {

+ return invalid_entry_exception();
+ }
+

+ // Only files may be linked as not to create cycles (directories are created and linked using
+ // CREATE_INODE_AS_DIR_ENTRY)
+ if (not _metadata_log._inodes[entry.entry_inode].is_file()) {

+ return invalid_entry_exception();
+ }
+

+ if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {

+ return invalid_entry_exception();
+ }

+ auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
+
+ if (dir.entries.count(dir_entry_name) != 0) {

+ return invalid_entry_exception();
+ }
+

+ _metadata_log.memory_only_add_dir_entry(dir, entry.entry_inode, std::move(dir_entry_name));

+ // TODO: Maybe mtime_ns for modifying directory?

+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
ondisk_create_inode_as_dir_entry_header entry;
if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 46cdf803..6259742e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -679,6 +679,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_bootstrap.hh
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_log_operations/create_file.hh
+ src/fs/metadata_log_operations/link_file.hh

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:37 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Marks that the file is opened by increasing the opened file counter.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

src/fs/metadata_log.hh | 3 +++
src/fs/metadata_log.cc | 22 ++++++++++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 02af052e..7bb2c6bc 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -319,6 +319,9 @@ class metadata_log {
// Removes empty directory or unlinks file
future<> remove(std::string path);

+ // TODO: what about permissions, uid, gid etc.
+ future<inode_t> open_file(std::string path);

+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 752682e4..00ce88d2 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -332,6 +332,28 @@ future<> metadata_log::remove(std::string path) {
return unlink_or_remove_file_operation::perform(*this, std::move(path), remove_semantics::DIR_ONLY);
}

+future<inode_t> metadata_log::open_file(std::string path) {
+ return path_lookup(path).then([this](inode_t inode) {
+ auto inode_it = _inodes.find(inode);
+ if (inode_it == _inodes.end()) {

+ return make_exception_future<inode_t>(operation_became_invalid_exception());
+ }

+ inode_info* inode_info = &inode_it->second;
+ if (inode_info->is_directory()) {
+ return make_exception_future<inode_t>(is_directory_exception());
+ }
+
+ // TODO: can be replaced by sth like _inode_info.during_delete
+ return _locks.with_lock(metadata_log::locks::shared {inode}, [this, inode_info = std::move(inode_info), inode] {
+ if (not inode_exists(inode)) {

+ return make_exception_future<inode_t>(operation_became_invalid_exception());
+ }

+ ++inode_info->opened_files_count;

+ return make_ready_future<inode_t>(inode);

+ });
+ });
+}
+

// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:37 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

src/fs/metadata_disk_entries.hh | 21 ++
src/fs/metadata_log.hh | 9 +
src/fs/metadata_log_bootstrap.hh | 4 +
.../unlink_or_remove_file.hh | 196 ++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 24 +++
src/fs/metadata_log.cc | 25 +++
src/fs/metadata_log_bootstrap.cc | 70 +++++++
CMakeLists.txt | 1 +
8 files changed, 350 insertions(+)
create mode 100644 src/fs/metadata_log_operations/unlink_or_remove_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index b81c25f5..2f363a9b 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -76,6 +76,8 @@ enum ondisk_type : uint8_t {
DELETE_INODE,
ADD_DIR_ENTRY,
CREATE_INODE_AS_DIR_ENTRY,
+ DELETE_DIR_ENTRY,
+ DELETE_INODE_AND_DIR_ENTRY,
};

struct ondisk_checkpoint {
@@ -123,6 +125,19 @@ struct ondisk_create_inode_as_dir_entry_header {

// After header comes entry name

} __attribute__((packed));

+struct ondisk_delete_dir_entry_header {
+ inode_t dir_inode;

+ uint16_t entry_name_length;
+ // After header comes entry name
+} __attribute__((packed));
+

+struct ondisk_delete_inode_and_dir_entry_header {
+ inode_t inode_to_delete;
+ inode_t dir_inode;

+ uint16_t entry_name_length;
+ // After header comes entry name
+} __attribute__((packed));
+

template<typename T>

constexpr size_t ondisk_entry_size(const T& entry) noexcept {

static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or

@@ -136,5 +151,11 @@ constexpr size_t ondisk_entry_size(const ondisk_add_dir_entry_header& entry) noe

constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {
return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
}

+constexpr size_t ondisk_entry_size(const ondisk_delete_dir_entry_header& entry) noexcept {

+ return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
+}

+constexpr size_t ondisk_entry_size(const ondisk_delete_inode_and_dir_entry_header& entry) noexcept {

+ return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
+}

} // namespace seastar::fs
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index f5373458..02af052e 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -159,6 +159,7 @@ class metadata_log {

friend class create_and_open_unlinked_file_operation;
friend class create_file_operation;

friend class link_file_operation;
+ friend class unlink_or_remove_file_operation;

public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,

@@ -183,6 +184,7 @@ class metadata_log {

inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);

void memory_only_delete_inode(inode_t inode);
void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);

+ void memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name);

template<class Func>
void schedule_background_task(Func&& task) {
@@ -310,6 +312,13 @@ class metadata_log {

// Creates name (@p destination) for a file (not directory) @p source

future<> link_file(std::string source, std::string destination);

+ future<> unlink_file(std::string path);
+
+ future<> remove_directory(std::string path);
+
+ // Removes empty directory or unlinks file
+ future<> remove(std::string path);
+

// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();

diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 3b38b232..16b429ab 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -123,6 +123,10 @@ class metadata_log_bootstrap {

future<> bootstrap_create_inode_as_dir_entry();

+ future<> bootstrap_delete_dir_entry();
+
+ future<> bootstrap_delete_inode_and_dir_entry();

+
public:
static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);

diff --git a/src/fs/metadata_log_operations/unlink_or_remove_file.hh b/src/fs/metadata_log_operations/unlink_or_remove_file.hh
new file mode 100644
index 00000000..a5f29cd9
--- /dev/null
+++ b/src/fs/metadata_log_operations/unlink_or_remove_file.hh
@@ -0,0 +1,196 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/inode.hh"
+#include "fs/inode_info.hh"

+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/path.hh"

+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"

+
+namespace seastar::fs {
+

+enum class remove_semantics {
+ FILE_ONLY,
+ DIR_ONLY,
+ FILE_OR_DIR,
+};
+
+class unlink_or_remove_file_operation {
+ metadata_log& _metadata_log;
+ remove_semantics _remove_semantics;

+ std::string _entry_name;
+ inode_t _dir_inode;
+ inode_info::directory* _dir_info;

+ inode_t _entry_inode;
+ inode_info* _entry_inode_info;
+
+ unlink_or_remove_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<> unlink_or_remove(std::string path, remove_semantics remove_semantics) {
+ _remove_semantics = remove_semantics;

+ while (not path.empty() and path.back() == '/') {
+ path.pop_back();
+ }
+

+ _entry_name = extract_last_component(path);
+ if (_entry_name.empty()) {

+ return make_exception_future(invalid_path_exception()); // We cannot remove "/"
+ }
+ assert(path.empty() or path.back() == '/'); // Hence fast-check for "is directory" is done in path_lookup

+
+ return _metadata_log.path_lookup(path).then([this](inode_t dir_inode) {
+ _dir_inode = dir_inode;
+ // Fail-fast checks before locking (as locking may be expensive)

+ auto dir_it = _metadata_log._inodes.find(dir_inode);

+ if (dir_it == _metadata_log._inodes.end()) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+ assert(dir_it->second.is_directory() and "Directory cannot become file or there is a BUG in path_lookup");
+ _dir_info = &dir_it->second.get_directory();
+

+ auto entry_it = _dir_info->entries.find(_entry_name);
+ if (entry_it == _dir_info->entries.end()) {
+ return make_exception_future(no_such_file_or_directory_exception());
+ }
+ _entry_inode = entry_it->second;
+
+ _entry_inode_info = &_metadata_log._inodes.at(_entry_inode);
+ if (_entry_inode_info->is_directory()) {
+ switch (_remove_semantics) {
+ case remove_semantics::FILE_ONLY:
+ return make_exception_future(is_directory_exception());
+ case remove_semantics::DIR_ONLY:
+ case remove_semantics::FILE_OR_DIR:
+ break;
+ }
+
+ if (not _entry_inode_info->get_directory().entries.empty()) {
+ return make_exception_future(directory_not_empty_exception());
+ }
+ } else {
+ assert(_entry_inode_info->is_file());
+ switch (_remove_semantics) {
+ case remove_semantics::DIR_ONLY:
+ return make_exception_future(is_directory_exception());
+ case remove_semantics::FILE_ONLY:
+ case remove_semantics::FILE_OR_DIR:
+ break;
+ }
+ }
+
+ // Getting a lock on directory entry is enough to ensure it won't disappear because deleting directory
+ // requires it to be empty
+ if (_entry_inode_info->is_directory()) {
+ return _metadata_log._locks.with_locks(metadata_log::locks::unique {dir_inode, _entry_name}, metadata_log::locks::unique {_entry_inode}, [this] {
+ return unlink_or_remove_file_in_directory();
+ });
+ } else {
+ return _metadata_log._locks.with_locks(metadata_log::locks::unique {dir_inode, _entry_name}, metadata_log::locks::shared {_entry_inode}, [this] {
+ return unlink_or_remove_file_in_directory();
+ });
+ }
+ });
+ }
+
+ future<> unlink_or_remove_file_in_directory() {

+ if (not _metadata_log.inode_exists(_dir_inode)) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+

+ auto entry_it = _dir_info->entries.find(_entry_name);
+ if (entry_it == _dir_info->entries.end() or entry_it->second != _entry_inode) {

+ return make_exception_future(operation_became_invalid_exception());
+ }
+

+ if (_entry_inode_info->is_directory()) {
+ inode_info::directory& dir = _entry_inode_info->get_directory();
+ if (not dir.entries.empty()) {
+ return make_exception_future(directory_not_empty_exception());
+ }
+
+ assert(_entry_inode_info->directories_containing_file == 1);
+
+ // Ready to delete directory
+ ondisk_delete_inode_and_dir_entry_header ondisk_entry;
+ using entry_name_length_t = decltype(ondisk_entry.entry_name_length);
+ assert(_entry_name.size() <= std::numeric_limits<entry_name_length_t>::max());
+
+ ondisk_entry = {
+ _entry_inode,
+ _dir_inode,
+ static_cast<entry_name_length_t>(_entry_name.size())

+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:

+ _metadata_log.memory_only_delete_dir_entry(*_dir_info, _entry_name);
+ _metadata_log.memory_only_delete_inode(_entry_inode);

+ return now();
+ }
+ __builtin_unreachable();
+ }
+

+ assert(_entry_inode_info->is_file());
+
+ // Ready to unlink file
+ ondisk_delete_dir_entry_header ondisk_entry;
+ using entry_name_length_t = decltype(ondisk_entry.entry_name_length);
+ assert(_entry_name.size() <= std::numeric_limits<entry_name_length_t>::max());

+ ondisk_entry = {
+ _dir_inode,

+ static_cast<entry_name_length_t>(_entry_name.size())

+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:

+ _metadata_log.memory_only_delete_dir_entry(*_dir_info, _entry_name);
+ break;
+ }
+
+ if (not _entry_inode_info->is_linked() and not _entry_inode_info->is_open()) {
+ // File became unlinked and not open, so we need to delete it
+ _metadata_log.schedule_attempt_to_delete_inode(_entry_inode);
+ }
+

+ return now();
+ }
+

+public:
+ static future<> perform(metadata_log& metadata_log, std::string path, remove_semantics remove_semantics) {
+ return do_with(unlink_or_remove_file_operation(metadata_log), [path = std::move(path), remove_semantics](auto& obj) {
+ return obj.unlink_or_remove(std::move(path), remove_semantics);

+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

index 38180224..979a03c2 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -186,6 +186,30 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return APPENDED;
}

+ [[nodiscard]] virtual append_result append(const ondisk_delete_dir_entry_header& delete_dir_entry, const void* entry_name) noexcept {
+ ondisk_type type = DELETE_DIR_ENTRY;
+ if (not fits_for_append(ondisk_entry_size(delete_dir_entry))) {

+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));

+ append_bytes(&delete_dir_entry, sizeof(delete_dir_entry));
+ append_bytes(entry_name, delete_dir_entry.entry_name_length);

+ return APPENDED;
+ }
+

+ [[nodiscard]] virtual append_result append(const ondisk_delete_inode_and_dir_entry_header& delete_inode_and_dir_entry, const void* entry_name) noexcept {
+ ondisk_type type = DELETE_INODE_AND_DIR_ENTRY;
+ if (not fits_for_append(ondisk_entry_size(delete_inode_and_dir_entry))) {

+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));

+ append_bytes(&delete_inode_and_dir_entry, sizeof(delete_inode_and_dir_entry));
+ append_bytes(entry_name, delete_inode_and_dir_entry.entry_name_length);

+ return APPENDED;
+ }
+

using to_disk_buffer::flush_to_disk;
};

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index a8b17c2b..752682e4 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -29,6 +29,7 @@
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_log_operations/create_file.hh"
#include "fs/metadata_log_operations/link_file.hh"
+#include "fs/metadata_log_operations/unlink_or_remove_file.hh"

#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"

@@ -128,6 +129,18 @@ void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t
++it->second.directories_containing_file;
}

+void metadata_log::memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name) {
+ auto it = dir.entries.find(entry_name);
+ assert(it != dir.entries.end());
+
+ auto entry_it = _inodes.find(it->second);
+ assert(entry_it != _inodes.end());
+ assert(entry_it->second.is_linked());
+
+ --entry_it->second.directories_containing_file;
+ dir.entries.erase(it);

+}
+
void metadata_log::schedule_flush_of_curr_cluster() {
// Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {

@@ -307,6 +320,18 @@ future<> metadata_log::link_file(std::string source, std::string destination) {
});
}

+future<> metadata_log::unlink_file(std::string path) {
+ return unlink_or_remove_file_operation::perform(*this, std::move(path), remove_semantics::FILE_ONLY);
+}
+
+future<> metadata_log::remove_directory(std::string path) {
+ return unlink_or_remove_file_operation::perform(*this, std::move(path), remove_semantics::FILE_OR_DIR);
+}
+
+future<> metadata_log::remove(std::string path) {
+ return unlink_or_remove_file_operation::perform(*this, std::move(path), remove_semantics::DIR_ONLY);

+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially

diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 64396d11..3120fbd4 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -219,6 +219,10 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {

return bootstrap_add_dir_entry();
case CREATE_INODE_AS_DIR_ENTRY:
return bootstrap_create_inode_as_dir_entry();

+ case DELETE_DIR_ENTRY:
+ return bootstrap_delete_dir_entry();
+ case DELETE_INODE_AND_DIR_ENTRY:
+ return bootstrap_delete_inode_and_dir_entry();

}

// unknown type => metadata log corruption

@@ -340,6 +344,72 @@ future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_delete_dir_entry() {
+ ondisk_delete_dir_entry_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode)) {

+ return invalid_entry_exception();
+ }
+
+ std::string dir_entry_name;
+ if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {
+ return invalid_entry_exception();
+ }
+

+ if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
+ return invalid_entry_exception();
+ }
+ auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
+

+ auto it = dir.entries.find(dir_entry_name);
+ if (it == dir.entries.end()) {

+ return invalid_entry_exception();
+ }
+

+ _metadata_log.memory_only_delete_dir_entry(dir, std::move(dir_entry_name));

+ // TODO: Maybe mtime_ns for modifying directory?
+ return now();
+}
+

+future<> metadata_log_bootstrap::bootstrap_delete_inode_and_dir_entry() {
+ ondisk_delete_inode_and_dir_entry_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or not inode_exists(entry.inode_to_delete)) {

+ return invalid_entry_exception();
+ }
+
+ std::string dir_entry_name;
+ if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {
+ return invalid_entry_exception();
+ }
+

+ if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
+ return invalid_entry_exception();
+ }
+ auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
+

+ auto it = dir.entries.find(dir_entry_name);
+ if (it == dir.entries.end()) {

+ return invalid_entry_exception();
+ }
+

+ _metadata_log.memory_only_delete_dir_entry(dir, std::move(dir_entry_name));

+ // TODO: Maybe mtime_ns for modifying directory?
+

+ // TODO: there is so much copy & paste here...
+ // TODO: maybe to make ondisk_delete_inode_and_dir_entry_header have ondisk_delete_inode and
+ // ondisk_delete_dir_entry_header to ease deduplicating code?
+ inode_info& inode_to_delete_info = _metadata_log._inodes.at(entry.inode_to_delete);
+ if (inode_to_delete_info.directories_containing_file > 0) {

+ return invalid_entry_exception(); // Only unlinked inodes may be deleted
+ }
+

+ if (inode_to_delete_info.is_directory() and not inode_to_delete_info.get_directory().entries.empty()) {

+ return invalid_entry_exception(); // Only empty directories may be deleted
+ }
+

+ _metadata_log.memory_only_delete_inode(entry.inode_to_delete);

+ return now();
+}
+

future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
// Clear the metadata log

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 6259742e..e432e572 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -680,6 +680,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_log_operations/create_file.hh
src/fs/metadata_log_operations/link_file.hh
+ src/fs/metadata_log_operations/unlink_or_remove_file.hh

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:38 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Decreases opened file counter. If the file is unlinked and the
counter is zero then the file is automatically removed.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

src/fs/metadata_log.hh | 2 ++
src/fs/metadata_log.cc | 27 +++++++++++++++++++++++++++
2 files changed, 29 insertions(+)

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 7bb2c6bc..721e43b8 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -322,6 +322,8 @@ class metadata_log {

// TODO: what about permissions, uid, gid etc.

future<inode_t> open_file(std::string path);

+ future<> close_file(inode_t inode);

+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 00ce88d2..56954cf1 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -354,6 +354,33 @@ future<inode_t> metadata_log::open_file(std::string path) {
});
}

+future<> metadata_log::close_file(inode_t inode) {

+ auto inode_it = _inodes.find(inode);
+ if (inode_it == _inodes.end()) {

+ return make_exception_future(invalid_inode_exception());

+ }
+ inode_info* inode_info = &inode_it->second;
+ if (inode_info->is_directory()) {

+ return make_exception_future(is_directory_exception());
+ }
+
+
+ return _locks.with_lock(metadata_log::locks::shared {inode}, [this, inode, inode_info] {
+ if (not inode_exists(inode)) {

+ return make_exception_future(operation_became_invalid_exception());
+ }
+

+ assert(inode_info->is_open());
+
+ --inode_info->opened_files_count;
+ if (not inode_info->is_linked() and not inode_info->is_open()) {
+ // Unlinked and not open file should be removed
+ schedule_attempt_to_delete_inode(inode);
+ }
+ return now();
+ });

+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:41 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Each write can be divided into multiple smaller writes that can fall
into one of the following categories:
- small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes
are stored fully in memory
- medium write: writes above SMALL_WRITE_THRESHOLD and below
cluster_size bytes, those writes are stored on disk, they are appended
to the on-disk data log where data from different writes can be stored
in one cluster
- big write: writes that fully fit into one cluster, stored on disk
For example, one write can be divided into multiple big writes, some
small writes and some medium writes. Current implementation won't make
any unnecessary data copying. Data given by caller is either directly
used to write to disk or is copied as a small write.

Added cluster writer which is used to perform medium writes. Cluster
writer keeps a current position in the data log and appends new data
by writing it directly into disk.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

src/fs/cluster_writer.hh | 85 +++++++
src/fs/metadata_disk_entries.hh | 41 ++-
src/fs/metadata_log.hh | 16 +-
src/fs/metadata_log_bootstrap.hh | 8 +
src/fs/metadata_log_operations/write.hh | 318 ++++++++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 24 ++
src/fs/metadata_log.cc | 67 ++++-
src/fs/metadata_log_bootstrap.cc | 103 ++++++++
CMakeLists.txt | 2 +
9 files changed, 660 insertions(+), 4 deletions(-)
create mode 100644 src/fs/cluster_writer.hh
create mode 100644 src/fs/metadata_log_operations/write.hh

diff --git a/src/fs/cluster_writer.hh b/src/fs/cluster_writer.hh
new file mode 100644
index 00000000..2d2ff917
--- /dev/null
+++ b/src/fs/cluster_writer.hh
@@ -0,0 +1,85 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+#include "seastar/core/shared_ptr.hh"
+#include "seastar/fs/block_device.hh"
+
+#include <cstdlib>
+#include <cassert>

+
+namespace seastar::fs {
+

+// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
+// in order to finish construction.

+class cluster_writer {
+protected:

+ size_t _max_size = 0;
+ unit_size_t _alignment = 0;
+ disk_offset_t _cluster_beg_offset = 0;

+ size_t _next_write_offset = 0;
+public:
+ cluster_writer() = default;
+
+ virtual shared_ptr<cluster_writer> virtual_constructor() const {
+ return make_shared<cluster_writer>();
+ }

+
+ // Total number of bytes appended cannot exceed @p aligned_max_size.
+ // @p cluster_beg_offset is the disk offset of the beginning of the cluster.
+ virtual void init(size_t aligned_max_size, unit_size_t alignment, disk_offset_t cluster_beg_offset) {
+ assert(is_power_of_2(alignment));
+ assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
+ assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
+
+ _max_size = aligned_max_size;
+ _alignment = alignment;
+ _cluster_beg_offset = cluster_beg_offset;

+ _next_write_offset = 0;
+ }
+
+ // Writes @p aligned_buffer to @p device just after previous write (or at @p cluster_beg_offset passed to init()
+ // if it is the first write).
+ virtual future<size_t> write(const void* aligned_buffer, size_t aligned_len, block_device device) {
+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _alignment == 0);
+ assert(aligned_len % _alignment == 0);
+ assert(aligned_len <= bytes_left());
+
+ // Make sure the writer is usable before returning from this function
+ size_t curr_write_offset = _next_write_offset;
+ _next_write_offset += aligned_len;
+
+ return device.write(_cluster_beg_offset + curr_write_offset, aligned_buffer, aligned_len);
+ }
+
+ virtual size_t bytes_left() const noexcept { return _max_size - _next_write_offset; }

+
+ // Returns disk offset of the place where the first byte of next appended bytes would be after flush
+ // TODO: maybe better name for that function? Or any other method to extract that data?
+ virtual disk_offset_t current_disk_offset() const noexcept {

+ return _cluster_beg_offset + _next_write_offset;

+ }
+};
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 2f363a9b..4422e0b1 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -74,6 +74,10 @@ enum ondisk_type : uint8_t {
NEXT_METADATA_CLUSTER,
CREATE_INODE,
DELETE_INODE,
+ SMALL_WRITE,
+ MEDIUM_WRITE,
+ LARGE_WRITE,
+ LARGE_WRITE_WITHOUT_MTIME,
ADD_DIR_ENTRY,
CREATE_INODE_AS_DIR_ENTRY,
DELETE_DIR_ENTRY,
@@ -111,6 +115,35 @@ struct ondisk_delete_inode {
inode_t inode;
} __attribute__((packed));

+struct ondisk_small_write_header {
+ inode_t inode;
+ file_offset_t offset;
+ uint16_t length;
+ decltype(unix_metadata::mtime_ns) time_ns;
+ // After header comes data
+} __attribute__((packed));
+
+struct ondisk_medium_write {
+ inode_t inode;
+ file_offset_t offset;
+ disk_offset_t disk_offset;
+ uint32_t length;
+ decltype(unix_metadata::mtime_ns) time_ns;
+} __attribute__((packed));
+
+struct ondisk_large_write {
+ inode_t inode;
+ file_offset_t offset;
+ cluster_id_t data_cluster; // length == cluster_size
+ decltype(unix_metadata::mtime_ns) time_ns;
+} __attribute__((packed));
+
+struct ondisk_large_write_without_mtime {
+ inode_t inode;
+ file_offset_t offset;
+ cluster_id_t data_cluster; // length == cluster_size
+} __attribute__((packed));
+
struct ondisk_add_dir_entry_header {
inode_t dir_inode;
inode_t entry_inode;
@@ -142,9 +175,15 @@ template<typename T>

constexpr size_t ondisk_entry_size(const T& entry) noexcept {
static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or

std::is_same_v<T, ondisk_create_inode> or

- std::is_same_v<T, ondisk_delete_inode>, "ondisk entry size not defined for given type");
+ std::is_same_v<T, ondisk_delete_inode> or
+ std::is_same_v<T, ondisk_medium_write> or
+ std::is_same_v<T, ondisk_large_write> or
+ std::is_same_v<T, ondisk_large_write_without_mtime>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}
+constexpr size_t ondisk_entry_size(const ondisk_small_write_header& entry) noexcept {
+ return sizeof(ondisk_type) + sizeof(entry) + entry.length;
+}
constexpr size_t ondisk_entry_size(const ondisk_add_dir_entry_header& entry) noexcept {

return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
}

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 721e43b8..36e16280 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -23,6 +23,7 @@

#include "fs/cluster.hh"
#include "fs/cluster_allocator.hh"
+#include "fs/cluster_writer.hh"
#include "fs/inode.hh"
#include "fs/inode_info.hh"
#include "fs/metadata_disk_entries.hh"
@@ -54,6 +55,7 @@ class metadata_log {

// Takes care of writing current cluster of serialized metadata log entries to device

shared_ptr<metadata_to_disk_buffer> _curr_cluster_buff;
+ shared_ptr<cluster_writer> _curr_data_writer;
shared_future<> _background_futures = now();

// In memory metadata
@@ -160,10 +162,11 @@ class metadata_log {
friend class create_file_operation;
friend class link_file_operation;
friend class unlink_or_remove_file_operation;
+ friend class write_operation;

public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,

- shared_ptr<metadata_to_disk_buffer> cluster_buff);
+ shared_ptr<metadata_to_disk_buffer> cluster_buff, shared_ptr<cluster_writer> data_writer);

metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment);

@@ -181,8 +184,16 @@ class metadata_log {
return _inodes.count(inode) != 0;
}

+ void write_update(inode_info::file& file, inode_data_vec new_data_vec);
+
+ // Deletes data vectors that are subset of @p data_range and cuts overlapping data vectors to make them not overlap
+ void cut_out_data_range(inode_info::file& file, file_range range);
+

inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
void memory_only_delete_inode(inode_t inode);

+ void memory_only_small_write(inode_t inode, disk_offset_t offset, temporary_buffer<uint8_t> data);
+ void memory_only_disk_write(inode_t inode, file_offset_t file_offset, disk_offset_t disk_offset, size_t write_len);
+ void memory_only_update_mtime(inode_t inode, decltype(unix_metadata::mtime_ns) mtime_ns);

void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);

void memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name);

@@ -324,6 +335,9 @@ class metadata_log {

future<> close_file(inode_t inode);

+ future<size_t> write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
+ const io_priority_class& pc = default_priority_class());

+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();

diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 16b429ab..03c2eb9b 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -119,6 +119,14 @@ class metadata_log_bootstrap {

future<> bootstrap_delete_inode();

+ future<> bootstrap_small_write();
+
+ future<> bootstrap_medium_write();
+
+ future<> bootstrap_large_write();
+
+ future<> bootstrap_large_write_without_mtime();
+
future<> bootstrap_add_dir_entry();

future<> bootstrap_create_inode_as_dir_entry();
diff --git a/src/fs/metadata_log_operations/write.hh b/src/fs/metadata_log_operations/write.hh
new file mode 100644
index 00000000..afe3e2ae
--- /dev/null
+++ b/src/fs/metadata_log_operations/write.hh
@@ -0,0 +1,318 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/bitwise.hh"

+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"

+#include "fs/units.hh"
+#include "fs/cluster.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/shared_ptr.hh"
+#include "seastar/core/temporary_buffer.hh"

+
+namespace seastar::fs {
+

+class write_operation {
+public:
+ // TODO: decide about threshold for small write
+ static constexpr size_t SMALL_WRITE_THRESHOLD = std::numeric_limits<decltype(ondisk_small_write_header::length)>::max();
+
+private:
+ metadata_log& _metadata_log;
+ inode_t _inode;
+ const io_priority_class& _pc;
+
+ write_operation(metadata_log& metadata_log, inode_t inode, const io_priority_class& pc)
+ : _metadata_log(metadata_log), _inode(inode), _pc(pc) {
+ assert(_metadata_log._alignment <= SMALL_WRITE_THRESHOLD and
+ "Small write threshold should be at least as big as alignment");
+ }
+
+ future<size_t> write(const uint8_t* buffer, size_t write_len, file_offset_t file_offset) {
+ auto inode_it = _metadata_log._inodes.find(_inode);
+ if (inode_it == _metadata_log._inodes.end()) {
+ return make_exception_future<size_t>(invalid_inode_exception());
+ }
+ if (inode_it->second.is_directory()) {
+ return make_exception_future<size_t>(is_directory_exception());
+ }
+
+ // TODO: maybe check if there is enough free clusters before executing?
+ return _metadata_log._locks.with_lock(metadata_log::locks::shared {_inode}, [this, buffer, write_len, file_offset] {
+ if (not _metadata_log.inode_exists(_inode)) {
+ return make_exception_future<size_t>(operation_became_invalid_exception());
+ }
+ return iterate_writes(buffer, write_len, file_offset);
+ });
+ }
+
+ future<size_t> iterate_writes(const uint8_t* buffer, size_t write_len, file_offset_t file_offset) {
+ return do_with((size_t)0, [this, buffer, write_len, file_offset](size_t& completed_write_len) {
+ return repeat([this, &completed_write_len, buffer, write_len, file_offset] {
+ if (completed_write_len == write_len) {
+ return make_ready_future<bool_class<stop_iteration_tag>>(stop_iteration::yes);
+ }
+
+ size_t remaining_write_len = write_len - completed_write_len;
+
+ size_t expected_write_len;
+ if (remaining_write_len <= SMALL_WRITE_THRESHOLD) {
+ expected_write_len = remaining_write_len;
+ } else {
+ if (auto buffer_alignment = mod_by_power_of_2(reinterpret_cast<uintptr_t>(buffer) + completed_write_len,
+ _metadata_log._alignment); buffer_alignment != 0) {
+ // When buffer is not aligned then align it using one small write
+ expected_write_len = _metadata_log._alignment - buffer_alignment;
+ } else {
+ if (remaining_write_len >= _metadata_log._cluster_size) {
+ expected_write_len = _metadata_log._cluster_size;
+ } else {
+ // If the last write is medium then align write length by splitting last write into medium aligned
+ // write and small write
+ expected_write_len = remaining_write_len;
+ }
+ }
+ }
+
+ auto shifted_buffer = buffer + completed_write_len;
+ auto shifted_file_offset = file_offset + completed_write_len;
+ auto write_future = make_ready_future<size_t>(0);
+ if (expected_write_len <= SMALL_WRITE_THRESHOLD) {
+ write_future = do_small_write(shifted_buffer, expected_write_len, shifted_file_offset);
+ } else if (expected_write_len < _metadata_log._cluster_size) {
+ write_future = medium_write(shifted_buffer, expected_write_len, shifted_file_offset);
+ } else {
+ // Update mtime only when it is the first write
+ write_future = do_large_write(shifted_buffer, shifted_file_offset, completed_write_len == 0);
+ }
+
+ return write_future.then([&completed_write_len, expected_write_len](size_t write_len) {
+ completed_write_len += write_len;
+ if (write_len != expected_write_len) {
+ return stop_iteration::yes;
+ }

+ return stop_iteration::no;
+ });

+ }).then([&completed_write_len] {
+ return make_ready_future<size_t>(completed_write_len);
+ });
+ });
+ }
+
+ static decltype(unix_metadata::mtime_ns) get_current_time_ns() {

+ using namespace std::chrono;

+ return duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ }
+
+ future<size_t> do_small_write(const uint8_t* buffer, size_t expected_write_len, file_offset_t file_offset) {
+ auto curr_time_ns = get_current_time_ns();
+ ondisk_small_write_header ondisk_entry {
+ _inode,
+ file_offset,
+ static_cast<decltype(ondisk_small_write_header::length)>(expected_write_len),
+ curr_time_ns
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, buffer)) {

+ case metadata_log::append_result::TOO_BIG:

+ return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());

+ case metadata_log::append_result::NO_SPACE:

+ return make_exception_future<size_t>(no_more_space_exception());

+ case metadata_log::append_result::APPENDED:

+ temporary_buffer<uint8_t> tmp_buffer(buffer, expected_write_len);
+ _metadata_log.memory_only_small_write(_inode, file_offset, std::move(tmp_buffer));
+ _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
+ return make_ready_future<size_t>(expected_write_len);

+ }
+ __builtin_unreachable();
+ }
+

+ future<size_t> medium_write(const uint8_t* aligned_buffer, size_t expected_write_len, file_offset_t file_offset) {
+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
+ // TODO: medium write can be divided into bigger number of smaller writes. Maybe we should add checks
+ // for that and allow only limited number of medium writes? Or we could add to to_disk_buffer option for
+ // space 'reservation' to make sure that after division our write will fit into the buffer?
+ // That would also limit medium write to at most two smaller writes.
+ return do_with((size_t)0, [this, aligned_buffer, expected_write_len, file_offset](size_t& completed_write_len) {
+ return repeat([this, &completed_write_len, aligned_buffer, expected_write_len, file_offset] {
+ if (completed_write_len == expected_write_len) {
+ return make_ready_future<bool_class<stop_iteration_tag>>(stop_iteration::yes);
+ }
+
+ size_t remaining_write_len = expected_write_len - completed_write_len;
+ size_t curr_expected_write_len;
+ auto shifted_buffer = aligned_buffer + completed_write_len;
+ auto shifted_file_offset = file_offset + completed_write_len;
+ auto write_future = make_ready_future<size_t>(0);
+ if (remaining_write_len <= SMALL_WRITE_THRESHOLD) {
+ // We can use small write for the remaining data
+ curr_expected_write_len = remaining_write_len;
+ write_future = do_small_write(shifted_buffer, curr_expected_write_len, shifted_file_offset);
+ } else {
+ size_t rounded_remaining_write_len =
+ round_down_to_multiple_of_power_of_2(remaining_write_len, _metadata_log._alignment);
+
+ // We must use medium write
+ size_t buff_bytes_left = _metadata_log._curr_data_writer->bytes_left();
+ if (buff_bytes_left <= SMALL_WRITE_THRESHOLD) {
+ // TODO: add wasted buff_bytes_left bytes for compaction
+ // No space left in the current to_disk_buffer for medium write - allocate a new buffer
+ std::optional<cluster_id_t> cluster_opt = _metadata_log._cluster_allocator.alloc();
+ if (not cluster_opt) {
+ // TODO: maybe we should return partial write instead of exception?
+ return make_exception_future<bool_class<stop_iteration_tag>>(no_more_space_exception());
+ }
+
+ auto cluster_id = cluster_opt.value();
+ disk_offset_t cluster_disk_offset = cluster_id_to_offset(cluster_id, _metadata_log._cluster_size);
+ _metadata_log._curr_data_writer = _metadata_log._curr_data_writer->virtual_constructor();
+ _metadata_log._curr_data_writer->init(_metadata_log._cluster_size, _metadata_log._alignment,
+ cluster_disk_offset);
+ buff_bytes_left = _metadata_log._curr_data_writer->bytes_left();
+
+ curr_expected_write_len = rounded_remaining_write_len;
+ } else {
+ // There is enough space for medium write
+ curr_expected_write_len = buff_bytes_left >= rounded_remaining_write_len ?
+ rounded_remaining_write_len : buff_bytes_left;
+ }
+
+ write_future = do_medium_write(shifted_buffer, curr_expected_write_len, shifted_file_offset,
+ _metadata_log._curr_data_writer);
+ }
+
+ return write_future.then([&completed_write_len, curr_expected_write_len](size_t write_len) {
+ completed_write_len += write_len;
+ if (write_len != curr_expected_write_len) {
+ return stop_iteration::yes;
+ }

+ return stop_iteration::no;
+ });

+ }).then([&completed_write_len] {
+ return make_ready_future<size_t>(completed_write_len);
+ });;
+ });
+ }
+
+ future<size_t> do_medium_write(const uint8_t* aligned_buffer, size_t aligned_expected_write_len, file_offset_t file_offset,
+ shared_ptr<cluster_writer> disk_buffer) {
+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
+ assert(aligned_expected_write_len % _metadata_log._alignment == 0);
+ assert(disk_buffer->bytes_left() >= aligned_expected_write_len);
+
+ disk_offset_t device_offset = disk_buffer->current_disk_offset();
+ return disk_buffer->write(aligned_buffer, aligned_expected_write_len, _metadata_log._device).then(
+ [this, file_offset, disk_buffer = std::move(disk_buffer), device_offset](size_t write_len) {
+ // TODO: is this round down necessary?
+ // On partial write return aligned write length
+ write_len = round_down_to_multiple_of_power_of_2(write_len, _metadata_log._alignment);
+
+ auto curr_time_ns = get_current_time_ns();
+ ondisk_medium_write ondisk_entry {
+ _inode,
+ file_offset,
+ device_offset,
+ static_cast<decltype(ondisk_medium_write::length)>(write_len),
+ curr_time_ns
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {

+ case metadata_log::append_result::TOO_BIG:

+ return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());

+ case metadata_log::append_result::NO_SPACE:

+ return make_exception_future<size_t>(no_more_space_exception());

+ case metadata_log::append_result::APPENDED:

+ _metadata_log.memory_only_disk_write(_inode, file_offset, device_offset, write_len);
+ _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
+ return make_ready_future<size_t>(write_len);
+ }
+ __builtin_unreachable();
+ });
+ }
+
+ future<size_t> do_large_write(const uint8_t* aligned_buffer, file_offset_t file_offset, bool update_mtime) {
+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
+ // aligned_expected_write_len = _metadata_log._cluster_size
+ std::optional<cluster_id_t> cluster_opt = _metadata_log._cluster_allocator.alloc();
+ if (not cluster_opt) {
+ return make_exception_future<size_t>(no_more_space_exception());
+ }
+ auto cluster_id = cluster_opt.value();
+ disk_offset_t cluster_disk_offset = cluster_id_to_offset(cluster_id, _metadata_log._cluster_size);
+
+ return _metadata_log._device.write(cluster_disk_offset, aligned_buffer, _metadata_log._cluster_size, _pc).then(
+ [this, file_offset, cluster_id, cluster_disk_offset, update_mtime](size_t write_len) {
+ if (write_len != _metadata_log._cluster_size) {
+ _metadata_log._cluster_allocator.free(cluster_id);
+ return make_ready_future<size_t>(0);
+ }
+
+ metadata_log::append_result append_result;
+ if (update_mtime) {
+ auto curr_time_ns = get_current_time_ns();
+ ondisk_large_write ondisk_entry {
+ _inode,
+ file_offset,
+ cluster_id,
+ curr_time_ns
+ };
+ append_result = _metadata_log.append_ondisk_entry(ondisk_entry);
+ if (append_result == metadata_log::append_result::APPENDED) {
+ _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
+ }
+ } else {
+ ondisk_large_write_without_mtime ondisk_entry {
+ _inode,
+ file_offset,
+ cluster_id
+ };
+ append_result = _metadata_log.append_ondisk_entry(ondisk_entry);
+ }
+
+ switch (append_result) {

+ case metadata_log::append_result::TOO_BIG:

+ return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());

+ case metadata_log::append_result::NO_SPACE:

+ _metadata_log._cluster_allocator.free(cluster_id);
+ return make_exception_future<size_t>(no_more_space_exception());

+ case metadata_log::append_result::APPENDED:

+ _metadata_log.memory_only_disk_write(_inode, file_offset, cluster_disk_offset, write_len);
+ return make_ready_future<size_t>(write_len);

+ }
+ __builtin_unreachable();
+ });

+ }
+
+public:
+ static future<size_t> perform(metadata_log& metadata_log, inode_t inode, file_offset_t pos, const void* buffer,
+ size_t len, const io_priority_class& pc) {
+ return do_with(write_operation(metadata_log, inode, pc), [buffer, len, pos](auto& obj) {
+ return obj.write(static_cast<const uint8_t*>(buffer), len, pos);

+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

index 979a03c2..6a71d96e 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -161,6 +161,30 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(DELETE_INODE, delete_inode);
}

+ [[nodiscard]] virtual append_result append(const ondisk_small_write_header& small_write, const void* data) noexcept {
+ ondisk_type type = SMALL_WRITE;
+ if (not fits_for_append(ondisk_entry_size(small_write))) {

+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));

+ append_bytes(&small_write, sizeof(small_write));
+ append_bytes(data, small_write.length);

+ return APPENDED;
+ }
+

+ [[nodiscard]] virtual append_result append(const ondisk_medium_write& medium_write) noexcept {
+ return append_simple(MEDIUM_WRITE, medium_write);
+ }
+
+ [[nodiscard]] virtual append_result append(const ondisk_large_write& large_write) noexcept {
+ return append_simple(LARGE_WRITE, large_write);
+ }
+
+ [[nodiscard]] virtual append_result append(const ondisk_large_write_without_mtime& large_write_without_mtime) noexcept {
+ return append_simple(LARGE_WRITE_WITHOUT_MTIME, large_write_without_mtime);
+ }
+

[[nodiscard]] virtual append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept {

ondisk_type type = ADD_DIR_ENTRY;
if (not fits_for_append(ondisk_entry_size(add_dir_entry))) {
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 56954cf1..70434a4a 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -30,6 +30,7 @@
#include "fs/metadata_log_operations/create_file.hh"
#include "fs/metadata_log_operations/link_file.hh"
#include "fs/metadata_log_operations/unlink_or_remove_file.hh"
+#include "fs/metadata_log_operations/write.hh"

#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"

@@ -57,11 +58,12 @@
namespace seastar::fs {

metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t alignment,

- shared_ptr<metadata_to_disk_buffer> cluster_buff)
+ shared_ptr<metadata_to_disk_buffer> cluster_buff, shared_ptr<cluster_writer> data_writer)
: _device(std::move(device))
, _cluster_size(cluster_size)
, _alignment(alignment)
, _curr_cluster_buff(std::move(cluster_buff))
+, _curr_data_writer(std::move(data_writer))
, _cluster_allocator({}, {})
, _inode_allocator(1, 0) {
assert(is_power_of_2(alignment));
@@ -70,7 +72,7 @@ metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t

metadata_log::metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment)
: metadata_log(std::move(device), cluster_size, alignment,
- make_shared<metadata_to_disk_buffer>()) {}
+ make_shared<metadata_to_disk_buffer>(), make_shared<cluster_writer>()) {}

future<> metadata_log::bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
@@ -84,6 +86,27 @@ future<> metadata_log::shutdown() {
});
}

+void metadata_log::write_update(inode_info::file& file, inode_data_vec new_data_vec) {

+ // TODO: for compaction: update used inode_data_vec

+ auto file_size = file.size();
+ if (file_size < new_data_vec.data_range.beg) {
+ file.data.emplace(file_size, inode_data_vec {
+ {file_size, new_data_vec.data_range.beg},
+ inode_data_vec::hole_data {}
+ });
+ } else {
+ cut_out_data_range(file, new_data_vec.data_range);
+ }
+
+ file.data.emplace(new_data_vec.data_range.beg, std::move(new_data_vec));
+}
+
+void metadata_log::cut_out_data_range(inode_info::file& file, file_range range) {
+ file.cut_out_data_range(range, [](inode_data_vec data_vec) {
+ (void)data_vec; // TODO: for compaction: update used inode_data_vec
+ });
+}
+
inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata) {
assert(_inodes.count(inode) == 0);
return _inodes.emplace(inode, inode_info {
@@ -118,6 +141,41 @@ void metadata_log::memory_only_delete_inode(inode_t inode) {
_inodes.erase(it);
}

+void metadata_log::memory_only_small_write(inode_t inode, file_offset_t file_offset, temporary_buffer<uint8_t> data) {
+ inode_data_vec data_vec = {
+ {file_offset, file_offset + data.size()},
+ inode_data_vec::in_mem_data {std::move(data)}
+ };
+

+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());

+ assert(it->second.is_file());
+ write_update(it->second.get_file(), std::move(data_vec));
+}
+
+void metadata_log::memory_only_disk_write(inode_t inode, file_offset_t file_offset, disk_offset_t disk_offset,
+ size_t write_len) {
+ inode_data_vec data_vec = {
+ {file_offset, file_offset + write_len},
+ inode_data_vec::on_disk_data {disk_offset}
+ };
+

+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());

+ assert(it->second.is_file());
+ write_update(it->second.get_file(), std::move(data_vec));
+}
+
+void metadata_log::memory_only_update_mtime(inode_t inode, decltype(unix_metadata::mtime_ns) mtime_ns) {

+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());

+ it->second.metadata.mtime_ns = mtime_ns;
+ // ctime should be updated when contents is modified
+ if (it->second.metadata.ctime_ns < mtime_ns) {
+ it->second.metadata.ctime_ns = mtime_ns;
+ }

+}
+
void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {
auto it = _inodes.find(entry_inode);
assert(it != _inodes.end());

@@ -381,6 +439,11 @@ future<> metadata_log::close_file(inode_t inode) {
});
}

+future<size_t> metadata_log::write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
+ const io_priority_class& pc) {
+ return write_operation::perform(*this, inode, pos, buffer, len, pc);

+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially

diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 3120fbd4..52354181 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -111,8 +111,13 @@ future<> metadata_log_bootstrap::bootstrap(cluster_id_t first_metadata_cluster_i
if (free_clusters.empty()) {
return make_exception_future(no_more_space_exception());
}
+ cluster_id_t datalog_cluster_id = free_clusters.front();
free_clusters.pop_front();

+ _metadata_log._curr_data_writer = _metadata_log._curr_data_writer->virtual_constructor();
+ _metadata_log._curr_data_writer->init(_metadata_log._cluster_size, _metadata_log._alignment,
+ cluster_id_to_offset(datalog_cluster_id, _metadata_log._cluster_size));

+
mlogger.debug("free clusters: {}", free_clusters.size());

_metadata_log._cluster_allocator = cluster_allocator(std::move(_taken_clusters), std::move(free_clusters));

@@ -215,6 +220,14 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {

return bootstrap_create_inode();
case DELETE_INODE:
return bootstrap_delete_inode();

+ case SMALL_WRITE:
+ return bootstrap_small_write();
+ case MEDIUM_WRITE:
+ return bootstrap_medium_write();
+ case LARGE_WRITE:
+ return bootstrap_large_write();
+ case LARGE_WRITE_WITHOUT_MTIME:
+ return bootstrap_large_write_without_mtime();
case ADD_DIR_ENTRY:
return bootstrap_add_dir_entry();
case CREATE_INODE_AS_DIR_ENTRY:
@@ -284,6 +297,96 @@ future<> metadata_log_bootstrap::bootstrap_delete_inode() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_small_write() {
+ ondisk_small_write_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

+ return invalid_entry_exception();
+ }
+

+ if (not _metadata_log._inodes[entry.inode].is_file()) {

+ return invalid_entry_exception();
+ }
+

+ auto data_opt = _curr_checkpoint.read_tmp_buff(entry.length);
+ if (not data_opt) {
+ return invalid_entry_exception();
+ }
+ temporary_buffer<uint8_t>& data = *data_opt;
+
+ _metadata_log.memory_only_small_write(entry.inode, entry.offset, std::move(data));
+ _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);

+ return now();
+}
+

+future<> metadata_log_bootstrap::bootstrap_medium_write() {
+ ondisk_medium_write entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

+ return invalid_entry_exception();
+ }
+

+ if (not _metadata_log._inodes[entry.inode].is_file()) {

+ return invalid_entry_exception();
+ }
+

+ cluster_id_t data_cluster_id = offset_to_cluster_id(entry.disk_offset, _metadata_log._cluster_size);
+ if (_available_clusters.beg > data_cluster_id or
+ _available_clusters.end <= data_cluster_id) {
+ return invalid_entry_exception();
+ }
+ // TODO: we could check overlapping with other writes
+ _taken_clusters.emplace(data_cluster_id);
+
+ _metadata_log.memory_only_disk_write(entry.inode, entry.offset, entry.disk_offset, entry.length);
+ _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);

+ return now();
+}
+

+future<> metadata_log_bootstrap::bootstrap_large_write() {
+ ondisk_large_write entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

+ return invalid_entry_exception();
+ }
+

+ if (not _metadata_log._inodes[entry.inode].is_file()) {

+ return invalid_entry_exception();
+ }
+

+ if (_available_clusters.beg > entry.data_cluster or
+ _available_clusters.end <= entry.data_cluster or
+ _taken_clusters.count(entry.data_cluster) != 0) {
+ return invalid_entry_exception();
+ }
+ _taken_clusters.emplace((cluster_id_t)entry.data_cluster);
+
+ _metadata_log.memory_only_disk_write(entry.inode, entry.offset,
+ cluster_id_to_offset(entry.data_cluster, _metadata_log._cluster_size), _metadata_log._cluster_size);
+ _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);

+ return now();
+}
+

+// TODO: copy pasting :(
+future<> metadata_log_bootstrap::bootstrap_large_write_without_mtime() {
+ ondisk_large_write_without_mtime entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

+ return invalid_entry_exception();
+ }
+

+ if (not _metadata_log._inodes[entry.inode].is_file()) {

+ return invalid_entry_exception();
+ }
+

+ if (_available_clusters.beg > entry.data_cluster or
+ _available_clusters.end <= entry.data_cluster or
+ _taken_clusters.count(entry.data_cluster) != 0) {
+ return invalid_entry_exception();
+ }
+ _taken_clusters.emplace((cluster_id_t)entry.data_cluster);
+
+ _metadata_log.memory_only_disk_write(entry.inode, entry.offset,
+ cluster_id_to_offset(entry.data_cluster, _metadata_log._cluster_size), _metadata_log._cluster_size);

+ return now();
+}
+

future<> metadata_log_bootstrap::bootstrap_add_dir_entry() {
ondisk_add_dir_entry_header entry;

if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or

diff --git a/CMakeLists.txt b/CMakeLists.txt
index e432e572..840a02aa 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -668,6 +668,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/cluster.hh
src/fs/cluster_allocator.cc
src/fs/cluster_allocator.hh
+ src/fs/cluster_writer.hh
src/fs/crc.hh
src/fs/file.cc
src/fs/inode.hh
@@ -681,6 +682,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_operations/create_file.hh
src/fs/metadata_log_operations/link_file.hh
src/fs/metadata_log_operations/unlink_or_remove_file.hh
+ src/fs/metadata_log_operations/write.hh

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:41 AM4/20/20

to seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com

From: Wojciech Mitros <wmi...@protonmail.com>

Truncate can be used on a file to change its size. When the new
size is lower than current, the data at higher offsets will be lost,
and when it's larger, the file will be filled with null bytes.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---

src/fs/metadata_disk_entries.hh | 10 ++-
src/fs/metadata_log.hh | 5 ++
src/fs/metadata_log_bootstrap.hh | 2 +
src/fs/metadata_log_operations/truncate.hh | 90 ++++++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 4 +
src/fs/metadata_log.cc | 26 +++++++
src/fs/metadata_log_bootstrap.cc | 17 ++++
CMakeLists.txt | 1 +
8 files changed, 154 insertions(+), 1 deletion(-)
create mode 100644 src/fs/metadata_log_operations/truncate.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 4422e0b1..8c9f0499 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -78,6 +78,7 @@ enum ondisk_type : uint8_t {
MEDIUM_WRITE,
LARGE_WRITE,
LARGE_WRITE_WITHOUT_MTIME,
+ TRUNCATE,
ADD_DIR_ENTRY,
CREATE_INODE_AS_DIR_ENTRY,
DELETE_DIR_ENTRY,
@@ -144,6 +145,12 @@ struct ondisk_large_write_without_mtime {

cluster_id_t data_cluster; // length == cluster_size

} __attribute__((packed));

+struct ondisk_truncate {
+ inode_t inode;
+ file_offset_t size;

+ decltype(unix_metadata::mtime_ns) time_ns;
+} __attribute__((packed));
+

struct ondisk_add_dir_entry_header {
inode_t dir_inode;
inode_t entry_inode;

@@ -178,7 +185,8 @@ constexpr size_t ondisk_entry_size(const T& entry) noexcept {

std::is_same_v<T, ondisk_delete_inode> or

std::is_same_v<T, ondisk_medium_write> or

std::is_same_v<T, ondisk_large_write> or

- std::is_same_v<T, ondisk_large_write_without_mtime>, "ondisk entry size not defined for given type");
+ std::is_same_v<T, ondisk_large_write_without_mtime> or
+ std::is_same_v<T, ondisk_truncate>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}

constexpr size_t ondisk_entry_size(const ondisk_small_write_header& entry) noexcept {

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 36e16280..1ee29842 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -161,6 +161,7 @@ class metadata_log {
friend class create_and_open_unlinked_file_operation;

friend class create_file_operation;
friend class link_file_operation;

+ friend class truncate_operation;
friend class unlink_or_remove_file_operation;
friend class write_operation;

@@ -194,6 +195,7 @@ class metadata_log {

void memory_only_small_write(inode_t inode, disk_offset_t offset, temporary_buffer<uint8_t> data);

void memory_only_disk_write(inode_t inode, file_offset_t file_offset, disk_offset_t disk_offset, size_t write_len);

void memory_only_update_mtime(inode_t inode, decltype(unix_metadata::mtime_ns) mtime_ns);

+ void memory_only_truncate(inode_t inode, disk_offset_t size);

void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);
void memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name);

@@ -338,6 +340,9 @@ class metadata_log {

future<size_t> write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,

const io_priority_class& pc = default_priority_class());

+ // Truncates a file or or extends it with a "hole" data_vec to a specified size
+ future<> truncate(inode_t inode, file_offset_t size);

+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh

index 03c2eb9b..5c3584da 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -127,6 +127,8 @@ class metadata_log_bootstrap {

future<> bootstrap_large_write_without_mtime();

+ future<> bootstrap_truncate();

+
future<> bootstrap_add_dir_entry();

future<> bootstrap_create_inode_as_dir_entry();

diff --git a/src/fs/metadata_log_operations/truncate.hh b/src/fs/metadata_log_operations/truncate.hh
new file mode 100644
index 00000000..abd7d158
--- /dev/null
+++ b/src/fs/metadata_log_operations/truncate.hh
@@ -0,0 +1,90 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+

+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/units.hh"

+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"

+
+namespace seastar::fs {
+

+class truncate_operation {
+

+ metadata_log& _metadata_log;
+ inode_t _inode;
+

+ truncate_operation(metadata_log& metadata_log, inode_t inode)
+ : _metadata_log(metadata_log), _inode(inode) {
+ }
+
+ future<> truncate(file_offset_t size) {

+ auto inode_it = _metadata_log._inodes.find(_inode);
+ if (inode_it == _metadata_log._inodes.end()) {

+ return make_exception_future(invalid_inode_exception());

+ }
+ if (inode_it->second.is_directory()) {

+ return make_exception_future(is_directory_exception());
+ }
+
+ return _metadata_log._locks.with_lock(metadata_log::locks::shared {_inode}, [this, size] {
+ if (not _metadata_log.inode_exists(_inode)) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+ return do_truncate(size);
+ });
+ }
+
+ future<> do_truncate(file_offset_t size) {

+ using namespace std::chrono;

+ uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ ondisk_truncate ondisk_entry {
+ _inode,
+ size,

+ curr_time_ns
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {
+ case metadata_log::append_result::TOO_BIG:

+ return make_exception_future(cluster_size_too_small_to_perform_operation_exception());

+ case metadata_log::append_result::NO_SPACE:

+ return make_exception_future(no_more_space_exception());

+ case metadata_log::append_result::APPENDED:

+ _metadata_log.memory_only_truncate(_inode, size);
+ _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
+ return make_ready_future();

+ }
+ __builtin_unreachable();
+ }
+

+public:
+ static future<> perform(metadata_log& metadata_log, inode_t inode, file_offset_t size) {
+ return do_with(truncate_operation(metadata_log, inode), [size](auto& obj) {
+ return obj.truncate(size);

+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

index 6a71d96e..714d9f59 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -185,6 +185,10 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(LARGE_WRITE_WITHOUT_MTIME, large_write_without_mtime);
}

+ [[nodiscard]] virtual append_result append(const ondisk_truncate& truncate) noexcept {
+ return append_simple(TRUNCATE, truncate);

+ }
+
[[nodiscard]] virtual append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept {
ondisk_type type = ADD_DIR_ENTRY;
if (not fits_for_append(ondisk_entry_size(add_dir_entry))) {
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc

index 70434a4a..18f52dfc 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc

@@ -29,6 +29,7 @@
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_log_operations/create_file.hh"
#include "fs/metadata_log_operations/link_file.hh"

+#include "fs/metadata_log_operations/truncate.hh"
#include "fs/metadata_log_operations/unlink_or_remove_file.hh"
#include "fs/metadata_log_operations/write.hh"
#include "fs/metadata_to_disk_buffer.hh"
@@ -176,6 +177,27 @@ void metadata_log::memory_only_update_mtime(inode_t inode, decltype(unix_metadat
}
}

+void metadata_log::memory_only_truncate(inode_t inode, file_offset_t size) {

+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());
+ assert(it->second.is_file());

+ auto& file = it->second.get_file();
+

+ auto file_size = file.size();

+ if (size > file_size) {
+ file.data.emplace(file_size, inode_data_vec {
+ {file_size, size},

+ inode_data_vec::hole_data {}
+ });
+ } else {

+ // TODO: for compaction: update used inode_data_vec

+ cut_out_data_range(file, {
+ size,
+ std::numeric_limits<decltype(file_range::end)>::max()

+ });
+ }
+}
+

void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {
auto it = _inodes.find(entry_inode);
assert(it != _inodes.end());

@@ -444,6 +466,10 @@ future<size_t> metadata_log::write(inode_t inode, file_offset_t pos, const void*

return write_operation::perform(*this, inode, pos, buffer, len, pc);
}

+future<> metadata_log::truncate(inode_t inode, file_offset_t size) {
+ return truncate_operation::perform(*this, inode, size);

+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc

index 52354181..5e3b74e4 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -228,6 +228,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_large_write();
case LARGE_WRITE_WITHOUT_MTIME:
return bootstrap_large_write_without_mtime();
+ case TRUNCATE:
+ return bootstrap_truncate();

case ADD_DIR_ENTRY:
return bootstrap_add_dir_entry();
case CREATE_INODE_AS_DIR_ENTRY:

@@ -387,6 +389,21 @@ future<> metadata_log_bootstrap::bootstrap_large_write_without_mtime() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_truncate() {
+ ondisk_truncate entry;

+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.inode].is_file()) {
+ return invalid_entry_exception();
+ }
+

+ _metadata_log.memory_only_truncate(entry.inode, entry.size);

+ _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);
+ return now();
+}
+

future<> metadata_log_bootstrap::bootstrap_add_dir_entry() {
ondisk_add_dir_entry_header entry;
if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
diff --git a/CMakeLists.txt b/CMakeLists.txt

index 840a02aa..b6c8ef3a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -681,6 +681,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_log_operations/create_file.hh
src/fs/metadata_log_operations/link_file.hh
+ src/fs/metadata_log_operations/truncate.hh
src/fs/metadata_log_operations/unlink_or_remove_file.hh
src/fs/metadata_log_operations/write.hh
src/fs/metadata_to_disk_buffer.hh
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:42 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Reads file data from disk and memory based on information stored in
inode's data vectors. Not optimized version - reads from disk are always
read into temporary buffers before copying to the buffer given by the
caller.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

src/fs/metadata_log.hh | 5 +
src/fs/metadata_log_operations/read.hh | 189 +++++++++++++++++++++++++
src/fs/metadata_log.cc | 6 +
CMakeLists.txt | 1 +
4 files changed, 201 insertions(+)
create mode 100644 src/fs/metadata_log_operations/read.hh

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 1ee29842..4ba2658a 100644

--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -161,6 +161,7 @@ class metadata_log {
friend class create_and_open_unlinked_file_operation;
friend class create_file_operation;
friend class link_file_operation;

+ friend class read_operation;

friend class truncate_operation;
friend class unlink_or_remove_file_operation;
friend class write_operation;

@@ -337,6 +338,10 @@ class metadata_log {

future<> close_file(inode_t inode);

+ // Unaligned reads and writes are supported but discouraged because of bad performance impact
+ future<size_t> read(inode_t inode, file_offset_t pos, void* buffer, size_t len,
+ const io_priority_class& pc = default_priority_class());

+
future<size_t> write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
const io_priority_class& pc = default_priority_class());

diff --git a/src/fs/metadata_log_operations/read.hh b/src/fs/metadata_log_operations/read.hh
new file mode 100644
index 00000000..33d3545a
--- /dev/null
+++ b/src/fs/metadata_log_operations/read.hh
@@ -0,0 +1,189 @@

+#include "fs/range.hh"

+#include "fs/units.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"

+#include "seastar/core/temporary_buffer.hh"

+
+namespace seastar::fs {
+

+class read_operation {

+ metadata_log& _metadata_log;
+ inode_t _inode;

+ const io_priority_class& _pc;
+

+ read_operation(metadata_log& metadata_log, inode_t inode, const io_priority_class& pc)

+ : _metadata_log(metadata_log), _inode(inode), _pc(pc) {}
+

+ future<size_t> read(uint8_t* buffer, size_t read_len, file_offset_t file_offset) {

+ auto inode_it = _metadata_log._inodes.find(_inode);
+ if (inode_it == _metadata_log._inodes.end()) {

+ return make_exception_future<size_t>(invalid_inode_exception());

+ }
+ if (inode_it->second.is_directory()) {

+ return make_exception_future<size_t>(is_directory_exception());
+ }

+ inode_info::file* file_info = &inode_it->second.get_file();

+
+ return _metadata_log._locks.with_lock(metadata_log::locks::shared {_inode},

+ [this, file_info, buffer, read_len, file_offset] {
+ // TODO: do we want to keep that lock during reading? Everything should work even after file removal
+ if (not _metadata_log.inode_exists(_inode)) {

+ return make_exception_future<size_t>(operation_became_invalid_exception());
+ }
+

+ // TODO: we can change it to deque to pop from data_vecs instead of iterating
+ std::vector<inode_data_vec> data_vecs;
+ // Extract data vectors from file_info
+ file_info->execute_on_data_range({file_offset, file_offset + read_len}, [&data_vecs](inode_data_vec data_vec) {
+ // TODO: for compaction: mark that clusters shouldn't be moved to _cluster_allocator before that read ends
+ data_vecs.emplace_back(std::move(data_vec));
+ });
+
+ return iterate_reads(buffer, file_offset, std::move(data_vecs));
+ });
+ }
+
+ future<size_t> iterate_reads(uint8_t* buffer, file_offset_t file_offset, std::vector<inode_data_vec> data_vecs) {
+ return do_with(std::move(data_vecs), (size_t)0, (size_t)0,
+ [this, buffer, file_offset](std::vector<inode_data_vec>& data_vecs, size_t& vec_idx, size_t& completed_read_len) {
+ return repeat([this, &completed_read_len, &data_vecs, &vec_idx, buffer, file_offset] {
+ if (vec_idx == data_vecs.size()) {

+ return make_ready_future<bool_class<stop_iteration_tag>>(stop_iteration::yes);
+ }
+

+ inode_data_vec& data_vec = data_vecs[vec_idx++];
+ size_t expected_read_len = data_vec.data_range.size();
+
+ return do_read(data_vec, buffer + data_vec.data_range.beg - file_offset).then(
+ [&completed_read_len, expected_read_len](size_t read_len) {
+ completed_read_len += read_len;
+ if (read_len != expected_read_len) {

+ return stop_iteration::yes;
+ }
+ return stop_iteration::no;
+ });

+ }).then([&completed_read_len] {
+ return make_ready_future<size_t>(completed_read_len);
+ });
+ });
+ }
+
+ struct disk_temp_buffer {
+ disk_range _disk_range;
+ temporary_buffer<uint8_t> _data;
+ };
+ // Keep last disk read to accelerate next disk reads in cases where next read is intersecting previous disk read
+ // (after alignment)
+ disk_temp_buffer _prev_disk_read;
+
+ future<size_t> do_read(inode_data_vec& data_vec, uint8_t* buffer) {
+ size_t expected_read_len = data_vec.data_range.size();

+
+ return std::visit(overloaded {

+ [&](inode_data_vec::in_mem_data& mem) {

+ std::memcpy(buffer, mem.data.get(), expected_read_len);
+ return make_ready_future<size_t>(expected_read_len);

+ },
+ [&](inode_data_vec::hole_data&) {

+ std::memset(buffer, 0, expected_read_len);
+ return make_ready_future<size_t>(expected_read_len);

+ },
+ [&](inode_data_vec::on_disk_data& disk_data) {

+ // TODO: we can optimize the case when disk_data.device_offset is aligned
+
+ // Copies data from source_buffer corresponding to the intersection of dest_disk_range
+ // and source_buffer.disk_range into buffer. dest_disk_range.beg corresponds to first byte of buffer
+ // Works when dest_disk_range.beg <= source_buffer._disk_range.beg
+ auto copy_left_intersecting_data =
+ [](uint8_t* buffer, disk_range dest_disk_range, const disk_temp_buffer& source_buffer) -> size_t {
+ disk_range intersect = intersection(dest_disk_range, source_buffer._disk_range);
+
+ assert((intersect.is_empty() or dest_disk_range.beg >= source_buffer._disk_range.beg) and
+ "Beggining of source buffer on disk should be before beggining of destination buffer on disk");
+
+ if (not intersect.is_empty()) {
+ // We can copy _data from disk_temp_buffer
+ disk_offset_t common_data_len = intersect.size();
+ disk_offset_t source_data_offset = dest_disk_range.beg - source_buffer._disk_range.beg;
+ // TODO: maybe we should split that memcpy to multiple parts because large reads can lead
+ // to spikes in latency
+ std::memcpy(buffer, source_buffer._data.get() + source_data_offset, common_data_len);
+ return common_data_len;
+ } else {
+ return 0;
+ }
+ };
+
+ disk_range remaining_read_range {
+ disk_data.device_offset,
+ disk_data.device_offset + expected_read_len
+ };
+
+ size_t current_read_len = 0;
+ if (not _prev_disk_read._data.empty()) {
+ current_read_len = copy_left_intersecting_data(buffer, remaining_read_range, _prev_disk_read);
+ if (current_read_len == expected_read_len) {
+ return make_ready_future<size_t>(expected_read_len);
+ }
+ remaining_read_range.beg += current_read_len;
+ buffer += current_read_len;
+ }
+
+ disk_temp_buffer new_disk_read;
+ new_disk_read._disk_range = {
+ round_down_to_multiple_of_power_of_2(remaining_read_range.beg, _metadata_log._alignment),
+ round_up_to_multiple_of_power_of_2(remaining_read_range.end, _metadata_log._alignment)
+ };
+ new_disk_read._data = temporary_buffer<uint8_t>::aligned(_metadata_log._alignment, new_disk_read._disk_range.size());
+
+ return _metadata_log._device.read(new_disk_read._disk_range.beg, new_disk_read._data.get_write(),
+ new_disk_read._disk_range.size(), _pc).then(
+ [this, copy_left_intersecting_data = std::move(copy_left_intersecting_data),
+ new_disk_read = std::move(new_disk_read),
+ remaining_read_range, buffer, current_read_len](size_t read_len) mutable {
+ new_disk_read._disk_range.end = new_disk_read._disk_range.beg + read_len;
+ current_read_len += copy_left_intersecting_data(buffer, remaining_read_range, new_disk_read);
+ _prev_disk_read = std::move(new_disk_read);
+ return current_read_len;
+ });

+ },
+ }, data_vec.data_location);

+ }
+
+public:
+ static future<size_t> perform(metadata_log& metadata_log, inode_t inode, file_offset_t pos, void* buffer,

+ size_t len, const io_priority_class& pc) {

+ return do_with(read_operation(metadata_log, inode, pc), [pos, buffer, len](auto& obj) {
+ return obj.read(static_cast<uint8_t*>(buffer), len, pos);

+ });
+ }
+};
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 18f52dfc..6a3e5c07 100644

--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -29,6 +29,7 @@
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_log_operations/create_file.hh"
#include "fs/metadata_log_operations/link_file.hh"

+#include "fs/metadata_log_operations/read.hh"

#include "fs/metadata_log_operations/truncate.hh"
#include "fs/metadata_log_operations/unlink_or_remove_file.hh"
#include "fs/metadata_log_operations/write.hh"

@@ -461,6 +462,11 @@ future<> metadata_log::close_file(inode_t inode) {
});
}

+future<size_t> metadata_log::read(inode_t inode, file_offset_t pos, void* buffer, size_t len,
+ const io_priority_class& pc) {
+ return read_operation::perform(*this, inode, pos, buffer, len, pc);
+}
+
future<size_t> metadata_log::write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
const io_priority_class& pc) {

return write_operation::perform(*this, inode, pos, buffer, len, pc);

diff --git a/CMakeLists.txt b/CMakeLists.txt
index b6c8ef3a..e4205031 100644

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -681,6 +681,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_log_operations/create_file.hh
src/fs/metadata_log_operations/link_file.hh

+ src/fs/metadata_log_operations/read.hh
src/fs/metadata_log_operations/truncate.hh
src/fs/metadata_log_operations/unlink_or_remove_file.hh
src/fs/metadata_log_operations/write.hh
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:43 AM4/20/20

to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Provides inteface to query file attributes that include permissions,
btime, mtime and ctime.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---

include/seastar/fs/stat.hh | 41 +++++++++++++++++++++++++++++++++++
src/fs/metadata_log.hh | 5 +++++
src/fs/metadata_log.cc | 44 ++++++++++++++++++++++++++++++++++++--
CMakeLists.txt | 1 +
4 files changed, 89 insertions(+), 2 deletions(-)
create mode 100644 include/seastar/fs/stat.hh

diff --git a/include/seastar/fs/stat.hh b/include/seastar/fs/stat.hh
new file mode 100644
index 00000000..3ccbf7bc
--- /dev/null
+++ b/include/seastar/fs/stat.hh
@@ -0,0 +1,41 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ */
+
+#pragma once
+

+#include "seastar/core/file-types.hh"
+
+#include <chrono>
+#include <sys/types.h>

+
+namespace seastar::fs {
+

+struct stat_data {
+ directory_entry_type type;

+ file_permissions perms;
+ uid_t uid;
+ gid_t gid;

+ std::chrono::system_clock::time_point time_born; // Time of creation
+ std::chrono::system_clock::time_point time_modified; // Time of last content modification
+ std::chrono::system_clock::time_point time_changed; // Time of last status change (either content or attributes)

+};
+
+} // namespace seastar::fs

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 4ba2658a..08f8e9aa 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -38,6 +38,7 @@
#include "seastar/core/shared_ptr.hh"
#include "seastar/core/temporary_buffer.hh"
#include "seastar/fs/exceptions.hh"
+#include "seastar/fs/stat.hh"

#include <chrono>
#include <cstddef>
@@ -309,6 +310,10 @@ class metadata_log {
});
}

+ stat_data stat(inode_t inode) const;
+
+ stat_data stat(const std::string& path) const;
+

// Returns size of the file or throws exception iff @p inode is invalid
file_offset_t file_size(inode_t inode) const;

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 6a3e5c07..8a9dfe98 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -43,7 +43,9 @@
#include "seastar/core/future-util.hh"
#include "seastar/core/future.hh"
#include "seastar/core/shared_mutex.hh"
+#include "seastar/fs/exceptions.hh"
#include "seastar/fs/overloaded.hh"
+#include "seastar/fs/stat.hh"

#include <boost/crc.hpp>
#include <boost/range/irange.hpp>
@@ -340,7 +342,6 @@ std::variant<inode_t, metadata_log::path_lookup_error> metadata_log::do_path_loo

}

future<inode_t> metadata_log::path_lookup(const std::string& path) const {

- auto lookup_res = do_path_lookup(path);
return std::visit(overloaded {
[](path_lookup_error error) {
switch (error) {
@@ -356,7 +357,7 @@ future<inode_t> metadata_log::path_lookup(const std::string& path) const {
[](inode_t inode) {
return make_ready_future<inode_t>(inode);
}
- }, lookup_res);
+ }, do_path_lookup(path));

}

file_offset_t metadata_log::file_size(inode_t inode) const {

@@ -375,6 +376,45 @@ file_offset_t metadata_log::file_size(inode_t inode) const {
}, it->second.contents);
}

+stat_data metadata_log::stat(inode_t inode) const {

+ auto it = _inodes.find(inode);

+ if (it == _inodes.end())
+ throw invalid_inode_exception();
+

+ const inode_info& inode_info = it->second;
+ return {
+ std::visit(overloaded {
+ [](const inode_info::file&) { return directory_entry_type::regular; },
+ [](const inode_info::directory&) { return directory_entry_type::directory; },
+ }, inode_info.contents),
+ inode_info.metadata.perms,
+ inode_info.metadata.uid,
+ inode_info.metadata.gid,
+ std::chrono::system_clock::time_point(std::chrono::nanoseconds(inode_info.metadata.btime_ns)),
+ std::chrono::system_clock::time_point(std::chrono::nanoseconds(inode_info.metadata.mtime_ns)),
+ std::chrono::system_clock::time_point(std::chrono::nanoseconds(inode_info.metadata.ctime_ns)),
+ };
+}
+
+stat_data metadata_log::stat(const std::string& path) const {
+ return std::visit(overloaded {
+ [](path_lookup_error error) -> stat_data {

+ switch (error) {
+ case path_lookup_error::NOT_ABSOLUTE:

+ throw path_is_not_absolute_exception();
+ case path_lookup_error::NO_ENTRY:
+ throw no_such_file_or_directory_exception();
+ case path_lookup_error::NOT_DIR:
+ throw path_component_not_directory_exception();

+ }
+ __builtin_unreachable();
+ },

+ [this](inode_t inode) {
+ return stat(inode);
+ }
+ }, do_path_lookup(path));
+}
+

future<> metadata_log::create_file(std::string path, file_permissions perms) {

return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_FILE).discard_result();
}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index e4205031..e4167018 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -661,6 +661,7 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/exceptions.hh
include/seastar/fs/file.hh
include/seastar/fs/overloaded.hh
+ include/seastar/fs/stat.hh

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:44 AM4/20/20

to seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com

From: Wojciech Mitros <wmi...@protonmail.com>

The test checks whether the data written by a to_disk_buffer to disk
is the same as the data appended to the buffer and the remaining buffer
space is correctly calculated on small examples.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---

tests/unit/fs_to_disk_buffer_test.cc | 160 +++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 4 +
2 files changed, 164 insertions(+)
create mode 100644 tests/unit/fs_to_disk_buffer_test.cc

diff --git a/tests/unit/fs_to_disk_buffer_test.cc b/tests/unit/fs_to_disk_buffer_test.cc
new file mode 100644
index 00000000..36b0274c
--- /dev/null
+++ b/tests/unit/fs_to_disk_buffer_test.cc
@@ -0,0 +1,160 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*

+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#include "fs/bitwise.hh"
+#include "fs/to_disk_buffer.hh"
+#include "fs/units.hh"
+#include "fs_mock_block_device.hh"
+
+#include <cstring>
+#include <seastar/core/units.hh>

+#include <seastar/fs/block_device.hh>
+#include <seastar/testing/test_case.hh>
+#include <seastar/testing/test_runner.hh>
+#include <seastar/testing/thread_test_case.hh>
+
+using namespace seastar;

+using namespace seastar::fs;
+

+constexpr unit_size_t alignment = 4*KB;
+constexpr unit_size_t max_siz = 32*MB; // reasonably larger than alignment
+
+BOOST_TEST_DONT_PRINT_LOG_VALUE(to_disk_buffer)
+
+SEASTAR_THREAD_TEST_CASE(test_initially_empty) {

+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);

+ auto disk_buf = to_disk_buffer();
+ disk_buf.init(max_siz, alignment, 0);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_simple_write) {

+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);

+ auto disk_buf = to_disk_buffer();
+ disk_buf.init(max_siz, alignment, 0);
+ auto buf = temporary_buffer<char>::aligned(alignment, max_siz);
+ auto expected_buf = temporary_buffer<char>::aligned(alignment, max_siz);
+ disk_offset_t len_aligned;
+ disk_buf.append_bytes("12345678", 6);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz-6);
+ disk_buf.append_bytes("abcdefghi", 9);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz-15);
+ disk_buf.flush_to_disk(dev).get();
+ len_aligned = round_up_to_multiple_of_power_of_2((disk_offset_t)15, alignment);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz-len_aligned);
+ dev.read<char>(0, buf.get_write(), len_aligned).get();
+ strncpy(expected_buf.get_write(), "123456abcdefghi", max_siz); // fills with null characters
+ BOOST_REQUIRE_EQUAL(std::memcmp(expected_buf.get(), buf.get(), len_aligned), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_multiple_write) {

+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);

+ auto disk_buf = to_disk_buffer();
+ disk_buf.init(max_siz, alignment, 0);
+ auto buf = temporary_buffer<char>::aligned(alignment, max_siz);
+ auto expected_buf = temporary_buffer<char>::aligned(alignment, max_siz);
+ disk_offset_t len_aligned, len_aligned2;
+ disk_buf.append_bytes("9876", 4);
+ disk_buf.append_bytes("zyxwvutsr", 9);
+ disk_buf.flush_to_disk(dev).get();
+ len_aligned = round_up_to_multiple_of_power_of_2((disk_offset_t)13, alignment);
+ dev.read<char>(0, buf.get_write(), len_aligned).get();
+ strncpy(expected_buf.get_write(), "9876zyxwvutsr", max_siz);
+ BOOST_REQUIRE_EQUAL(std::memcmp(expected_buf.get(), buf.get(), len_aligned), 0);
+
+ disk_buf.append_bytes("12345678", 6);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz-len_aligned-6);
+ disk_buf.append_bytes("abcdefghi", 9);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz-len_aligned-15);
+ disk_buf.flush_to_disk(dev).get();
+ len_aligned2 = round_up_to_multiple_of_power_of_2(len_aligned+15, alignment);
+ strncpy(expected_buf.get_write()+len_aligned, "123456abcdefghi", len_aligned2-len_aligned);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz-len_aligned2);
+ dev.read<char>(0, buf.get_write(), len_aligned2).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(expected_buf.get(), buf.get(), len_aligned2), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_empty_write) {

+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);

+ auto disk_buf = to_disk_buffer();
+ disk_buf.init(max_siz, alignment, 0);
+ disk_buf.flush_to_disk(dev).get();
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_empty_append_bytes) {

+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);

+ auto disk_buf = to_disk_buffer();
+ disk_buf.init(max_siz, alignment, 0);
+ disk_buf.append_bytes("123456", 0);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz);
+ disk_buf.flush_to_disk(dev).get();
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_combined) {

+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);

+ auto disk_buf = to_disk_buffer();
+ disk_buf.init(max_siz, alignment, 0);
+ auto buf = temporary_buffer<char>::aligned(alignment, max_siz);
+ auto inp = temporary_buffer<char>::aligned(alignment, max_siz);
+ auto expected_buf = temporary_buffer<char>::aligned(alignment, max_siz);
+ strncpy(inp.get_write(), "abcdefghij12345678**987654****", max_siz); // fills to max_siz with null characters
+ disk_offset_t beg = 0, end = 0;
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz - end);
+
+ disk_buf.append_bytes(inp.get(), 10);
+ end += 10;
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz - end);
+
+ disk_buf.flush_to_disk(dev).get();
+ end = round_up_to_multiple_of_power_of_2(end, alignment);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz - end);
+
+ dev.read<char>(beg, buf.get_write(), end-beg).get();
+ strncpy(expected_buf.get_write(), "abcdefghij", end-beg);
+ BOOST_REQUIRE_EQUAL(std::memcmp(expected_buf.get(), buf.get(), end-beg), 0);
+
+ beg = end;
+ disk_buf.append_bytes(inp.get()+10, 8);
+ end += 8;
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz - end);
+
+ disk_buf.append_bytes(inp.get()+20, 6);
+ end += 6;
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz - end);
+
+ disk_buf.flush_to_disk(dev).get();
+ end = round_up_to_multiple_of_power_of_2(end, alignment);
+ BOOST_REQUIRE_EQUAL(disk_buf.bytes_left(), max_siz - end);
+
+ dev.read<char>(beg, buf.get_write(), end-beg).get();
+ std::memset(expected_buf.get_write(), 0, end-beg);
+ strncpy(expected_buf.get_write(), inp.get()+10, 8);
+ strncpy(expected_buf.get_write()+8, inp.get()+20, 6);
+ BOOST_REQUIRE_EQUAL(std::memcmp(expected_buf.get(), buf.get(), end-beg), 0);
+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index 07551b0b..520546bc 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -377,6 +377,10 @@ if (Seastar_EXPERIMENTAL_FS)

SOURCES fs_path_test.cc)
seastar_add_test (fs_seastarfs
SOURCES fs_seastarfs_test.cc)

+ seastar_add_test (fs_to_disk_buffer
+ SOURCES
+ fs_to_disk_buffer_test.cc
+ fs_mock_block_device.cc)
endif()

seastar_add_test (semaphore
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:45 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Added mockers:
- mockers store information about every operation
- store list of virtually created mockers

Added tests for metadata_to_disk_buffer mocker. Tests check that
mocker behaves similarly to metadata_to_disk_buffer.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

tests/unit/fs_metadata_common.hh | 467 ++++++++++++++++++
tests/unit/fs_mock_cluster_writer.hh | 78 +++
tests/unit/fs_mock_metadata_to_disk_buffer.hh | 323 ++++++++++++
.../fs_mock_metadata_to_disk_buffer_test.cc | 357 +++++++++++++
tests/unit/CMakeLists.txt | 4 +
5 files changed, 1229 insertions(+)
create mode 100644 tests/unit/fs_metadata_common.hh
create mode 100644 tests/unit/fs_mock_cluster_writer.hh
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer.hh
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer_test.cc

diff --git a/tests/unit/fs_metadata_common.hh b/tests/unit/fs_metadata_common.hh
new file mode 100644
index 00000000..dfa6186d
--- /dev/null
+++ b/tests/unit/fs_metadata_common.hh
@@ -0,0 +1,467 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#pragma once
+
+#include "fs/bitwise.hh"

+#include "fs/cluster.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/metadata_log_operations/write.hh"
+#include "fs/units.hh"
+#include "fs_mock_block_device.hh"
+#include "fs_mock_cluster_writer.hh"
+#include "fs_mock_metadata_to_disk_buffer.hh"
+
+#include <seastar/core/shared_ptr.hh>
+#include <seastar/core/temporary_buffer.hh>
+#include <seastar/testing/test_case.hh>
+#include <seastar/testing/test_runner.hh>
+
+#include <assert.h>
+#include <cctype>
+#include <chrono>
+#include <cstdint>
+#include <iostream>
+#include <random>
+#include <string>
+#include <tuple>
+#include <utility>
+#include <vector>
+
+#define CHECK_CALL(...) BOOST_TEST_CONTEXT("Called from line: " << __LINE__) { __VA_ARGS__; }
+
+// Returns copy of the given value. Needed to solve misalignment problems.
+template<typename T>
+inline T copy_value(T x) {
+ return x;
+}
+
+namespace seastar {
+
+inline std::ostream& operator<<(std::ostream& os, const temporary_buffer<uint8_t>& data) {
+ constexpr size_t MAX_ELEMENTS_PRINTED = 10;
+ for (size_t i = 0; i < std::min(MAX_ELEMENTS_PRINTED, data.size()); ++i) {
+ if (isprint(data[i]) != 0) {
+ os << data[i];
+ } else {
+ os << '<' << (int)data[i] << '>';
+ }
+ }
+ if (data.size() > MAX_ELEMENTS_PRINTED) {
+ os << "...(size=" << data.size() << ')';
+ }
+ return os;

+}
+
+} // seastar

+
+namespace seastar::fs {
+

+template<typename ActionType>
+const auto is_append_type = mock_metadata_to_disk_buffer::is_append_type<ActionType>;
+template<typename ActionType>
+const auto is_type = mock_metadata_to_disk_buffer::is_type<ActionType>;
+template<typename ActionType>
+const auto get_by_append_type = mock_metadata_to_disk_buffer::get_by_append_type<ActionType>;
+template<typename ActionType>
+const auto get_by_type = mock_metadata_to_disk_buffer::get_by_type<ActionType>;
+
+using flush_to_disk_action = mock_metadata_to_disk_buffer::action::flush_to_disk;
+using append_action = mock_metadata_to_disk_buffer::action::append;
+
+constexpr auto SMALL_WRITE_THRESHOLD = write_operation::SMALL_WRITE_THRESHOLD;
+
+using medium_write_len_t = decltype(ondisk_medium_write::length);
+using small_write_len_t = decltype(ondisk_small_write_header::length);
+using unix_time_t = decltype(unix_metadata::mtime_ns);
+
+template<typename BlockDevice = mock_block_device_impl,
+ typename MetadataToDiskBuffer = mock_metadata_to_disk_buffer,
+ typename ClusterWriter = mock_cluster_writer>
+inline auto init_metadata_log(unit_size_t cluster_size, unit_size_t alignment, cluster_id_t metadata_log_cluster,
+ cluster_range cluster_range) {
+ auto dev_impl = make_shared<BlockDevice>();
+ metadata_log log(block_device(dev_impl), cluster_size, alignment,
+ make_shared<MetadataToDiskBuffer>(), make_shared<ClusterWriter>());
+ log.bootstrap(0, metadata_log_cluster, cluster_range, 1, 0).get();
+
+ return std::pair{std::move(dev_impl), std::move(log)};
+}
+
+inline auto& get_current_metadata_buffer() {
+ auto& created_buffers = mock_metadata_to_disk_buffer::virtually_constructed_buffers;
+ assert(created_buffers.size() > 0);
+ return created_buffers.back();
+}
+
+inline auto& get_current_cluster_writer() {
+ auto& created_writers = mock_cluster_writer::virtually_constructed_writers;
+ assert(created_writers.size() > 0);
+ return created_writers.back();
+}
+
+inline inode_t create_and_open_file(metadata_log& log, std::string name = "tmp") {
+ log.create_file("/" + name, file_permissions::default_dir_permissions).get0();
+ return log.open_file("/" + name).get0();
+}
+
+inline temporary_buffer<uint8_t> gen_buffer(size_t len, bool aligned, unit_size_t alignment) {

+ std::default_random_engine random_engine(testing::local_random_engine());

+ temporary_buffer<uint8_t> buff;
+ if (not aligned) {
+ // Make buff unaligned
+ buff = temporary_buffer<uint8_t>::aligned(alignment, round_up_to_multiple_of_power_of_2(len, alignment) + alignment);
+ size_t offset = std::uniform_int_distribution<>(1, alignment - 1)(random_engine);
+ buff.trim_front(offset);
+ } else {
+ buff = temporary_buffer<uint8_t>::aligned(alignment, round_up_to_multiple_of_power_of_2(len, alignment));
+ }
+ buff.trim(len);
+
+ for (auto [i, char_dis] = std::tuple((size_t)0, std::uniform_int_distribution<>(0, 256)); i < buff.size(); i++) {
+ buff.get_write()[i] = char_dis(random_engine);
+ }
+
+ return buff;
+}
+
+inline void aligned_write(metadata_log& log, inode_t inode, size_t bytes_num, unit_size_t alignment) {
+ assert(bytes_num % alignment == 0);
+ temporary_buffer<uint8_t> buff = gen_buffer(bytes_num, true, alignment);
+ size_t wrote = log.write(inode, 0, buff.get(), buff.size()).get0();
+ BOOST_REQUIRE_EQUAL(wrote, buff.size());
+}
+
+inline unix_time_t get_current_time_ns() {

+ using namespace std::chrono;

+ return duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+}
+
+inline bool operator==(const ondisk_unix_metadata& l, const ondisk_unix_metadata& r) {
+ return std::memcmp(&l, &r, sizeof(l)) == 0;
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_unix_metadata& metadata) {
+ os << "[" << metadata.perms << ",";
+ os << metadata.uid << ",";
+ os << metadata.gid << ",";
+ os << metadata.btime_ns << ",";
+ os << metadata.mtime_ns << ",";
+ os << metadata.ctime_ns;
+ return os << "]";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_next_metadata_cluster& entry) {
+ os << "{" << "cluster_id=" << entry.cluster_id;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_create_inode& entry) {
+ os << "{" << "inode=" << entry.inode;
+ os << ", is_directory=" << (int)entry.is_directory;
+ // os << ", metadata=" << entry.metadata;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_delete_inode& entry) {
+ os << "{" << "inode=" << entry.inode;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_small_write_header& entry) {
+ os << "{" << "inode=" << entry.inode;
+ os << ", offset=" << entry.offset;
+ os << ", length=" << entry.length;
+ // os << ", time_ns=" << entry.time_ns;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_medium_write& entry) {
+ os << "{" << "inode=" << entry.inode;
+ os << ", offset=" << entry.offset;
+ os << ", disk_offset=" << entry.disk_offset;
+ os << ", length=" << entry.length;
+ // os << ", time_ns=" << entry.time_ns;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_large_write& entry) {
+ os << "{" << "inode=" << entry.inode;
+ os << ", offset=" << entry.offset;
+ os << ", data_cluster=" << entry.data_cluster;
+ // os << ", time_ns=" << entry.time_ns;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_large_write_without_mtime& entry) {
+ os << "{" << "inode=" << entry.inode;
+ os << ", offset=" << entry.offset;
+ os << ", data_cluster=" << entry.data_cluster;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_truncate& entry) {
+ os << "{" << "inode=" << entry.inode;
+ os << ", size=" << entry.size;
+ // os << ", time_ns=" << entry.time_ns;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_add_dir_entry_header& entry) {
+ os << "{" << "dir_inode=" << entry.dir_inode;
+ os << ", entry_inode=" << entry.entry_inode;
+ os << ", entry_name_length=" << entry.entry_name_length;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_create_inode_as_dir_entry_header& entry) {
+ os << "{" << "entry_inode=" << entry.entry_inode;
+ os << ", dir_inode=" << entry.dir_inode;
+ os << ", entry_name_length=" << entry.entry_name_length;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_delete_dir_entry_header& entry) {
+ os << "{" << "dir_inode=" << entry.dir_inode;
+ os << ", entry_name_length=" << entry.entry_name_length;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_delete_inode_and_dir_entry_header& entry) {
+ os << "{" << "inode_to_delete=" << entry.inode_to_delete;
+ os << ", dir_inode=" << entry.dir_inode;
+ os << ", entry_name_length=" << entry.entry_name_length;
+ return os << "}";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_small_write& entry) {
+ os << "(header=" << entry.header;
+ os << ", data=" << entry.data;
+ return os << ")";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_add_dir_entry& entry) {
+ os << "(header=" << entry.header;
+ os << ", entry_name=" << entry.entry_name;
+ return os << ")";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_create_inode_as_dir_entry& entry) {
+ os << "(header=" << entry.header;
+ os << ", entry_name=" << entry.entry_name;
+ return os << ")";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_delete_dir_entry& entry) {
+ os << "(header=" << entry.header;
+ os << ", entry_name=" << entry.entry_name;
+ return os << ")";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const ondisk_delete_inode_and_dir_entry& entry) {
+ os << "(header=" << entry.header;
+ os << ", entry_name=" << entry.entry_name;
+ return os << ")";
+}
+
+inline std::ostream& operator<<(std::ostream& os, const mock_metadata_to_disk_buffer::action& action) {
+ std::visit(overloaded {
+ [&os](const mock_metadata_to_disk_buffer::action::append& append) {
+ os << "[append:";
+ std::visit(overloaded {
+ [&os](const ondisk_next_metadata_cluster& entry) {
+ os << "next_metadata_cluster=" << entry;
+ },
+ [&os](const ondisk_create_inode& entry) {
+ os << "create_inode=" << entry;
+ },
+ [&os](const ondisk_delete_inode& entry) {
+ os << "delete_inode=" << entry;
+ },
+ [&os](const ondisk_small_write& entry) {
+ os << "small_write=" << entry;
+ },
+ [&os](const ondisk_medium_write& entry) {
+ os << "medium_write=" << entry;
+ },
+ [&os](const ondisk_large_write& entry) {
+ os << "large_write=" << entry;
+ },
+ [&os](const ondisk_large_write_without_mtime& entry) {
+ os << "large_write_without_mtime=" << entry;
+ },
+ [&os](const ondisk_truncate& entry) {
+ os << "truncate=" << entry;
+ },
+ [&os](const ondisk_add_dir_entry& entry) {
+ os << "add_dir_entry=" << entry;
+ },
+ [&os](const ondisk_create_inode_as_dir_entry& entry) {
+ os << "create_inode_as_dir_entry=" << entry;
+ },
+ [&os](const ondisk_delete_dir_entry& entry) {
+ os << "delete_dir_entry=" << entry;
+ },
+ [&os](const ondisk_delete_inode_and_dir_entry& entry) {
+ os << "delete_inode_and_dir_entry=" << entry;
+ }
+ }, append.entry);
+ os << "]";
+ },
+ [&os](const mock_metadata_to_disk_buffer::action::flush_to_disk&) {
+ os << "[flush]";
+ }
+ }, action.data);
+ return os;
+}
+
+inline std::ostream& operator<<(std::ostream& os, const std::vector<mock_metadata_to_disk_buffer::action>& actions) {
+ for (size_t i = 0; i < actions.size(); ++i) {
+ if (i != 0) {
+ os << '\n';
+ }
+ os << actions[i];
+ }
+ return os;
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_next_metadata_cluster& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_next_metadata_cluster>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_next_metadata_cluster>(given_action);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.cluster_id), copy_value(expected_entry.cluster_id));
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_create_inode& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_create_inode>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_create_inode>(given_action);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.inode), copy_value(expected_entry.inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.is_directory), copy_value(expected_entry.is_directory));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.metadata), copy_value(expected_entry.metadata));
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_delete_inode& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_delete_inode>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_delete_inode>(given_action);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.inode), copy_value(expected_entry.inode));
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_small_write& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_small_write>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_small_write>(given_action);
+ BOOST_CHECK_EQUAL(given_entry.data, expected_entry.data);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.inode), copy_value(expected_entry.header.inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.offset), copy_value(expected_entry.header.offset));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.length), copy_value(expected_entry.header.length));
+ BOOST_CHECK_GE(copy_value(given_entry.header.time_ns), copy_value(expected_entry.header.time_ns));
+ BOOST_CHECK_LE(copy_value(given_entry.header.time_ns), get_current_time_ns());
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_medium_write& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_medium_write>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_medium_write>(given_action);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.inode), copy_value(expected_entry.inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.offset), copy_value(expected_entry.offset));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.disk_offset), copy_value(expected_entry.disk_offset));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.length), copy_value(expected_entry.length));
+ BOOST_CHECK_GE(copy_value(given_entry.time_ns), copy_value(expected_entry.time_ns));
+ BOOST_CHECK_LE(copy_value(given_entry.time_ns), get_current_time_ns());
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_large_write& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_large_write>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_large_write>(given_action);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.inode), copy_value(expected_entry.inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.offset), copy_value(expected_entry.offset));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.data_cluster), copy_value(expected_entry.data_cluster));
+ BOOST_CHECK_GE(copy_value(given_entry.time_ns), copy_value(expected_entry.time_ns));
+ BOOST_CHECK_LE(copy_value(given_entry.time_ns), get_current_time_ns());
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_large_write_without_mtime& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_large_write_without_mtime>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_large_write_without_mtime>(given_action);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.inode), copy_value(expected_entry.inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.offset), copy_value(expected_entry.offset));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.data_cluster), copy_value(expected_entry.data_cluster));
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_truncate& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_truncate>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_truncate>(given_action);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.inode), copy_value(expected_entry.inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.size), copy_value(expected_entry.size));
+ BOOST_CHECK_GE(copy_value(given_entry.time_ns), copy_value(expected_entry.time_ns));
+ BOOST_CHECK_LE(copy_value(given_entry.time_ns), get_current_time_ns());
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_add_dir_entry& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_add_dir_entry>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_add_dir_entry>(given_action);
+ BOOST_CHECK_EQUAL(given_entry.entry_name, expected_entry.entry_name);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.dir_inode), copy_value(expected_entry.header.dir_inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.entry_inode), copy_value(expected_entry.header.entry_inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.entry_name_length), copy_value(expected_entry.header.entry_name_length));
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_create_inode_as_dir_entry& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_create_inode_as_dir_entry>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_create_inode_as_dir_entry>(given_action);
+ BOOST_CHECK_EQUAL(given_entry.entry_name, expected_entry.entry_name);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.entry_inode.inode), copy_value(expected_entry.header.entry_inode.inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.entry_inode.is_directory),
+ copy_value(expected_entry.header.entry_inode.is_directory));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.entry_inode.metadata),
+ copy_value(expected_entry.header.entry_inode.metadata));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.dir_inode), copy_value(expected_entry.header.dir_inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.entry_name_length), copy_value(expected_entry.header.entry_name_length));
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_delete_dir_entry& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_delete_dir_entry>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_delete_dir_entry>(given_action);
+ BOOST_CHECK_EQUAL(given_entry.entry_name, expected_entry.entry_name);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.dir_inode), copy_value(expected_entry.header.dir_inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.entry_name_length), copy_value(expected_entry.header.entry_name_length));
+}
+
+inline void check_metadata_entries_equal(const mock_metadata_to_disk_buffer::action& given_action,
+ const ondisk_delete_inode_and_dir_entry& expected_entry) {
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_append_type<ondisk_delete_inode_and_dir_entry>(given_action));
+ auto& given_entry = mock_metadata_to_disk_buffer::get_by_append_type<ondisk_delete_inode_and_dir_entry>(given_action);
+ BOOST_CHECK_EQUAL(given_entry.entry_name, expected_entry.entry_name);
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.inode_to_delete), copy_value(expected_entry.header.inode_to_delete));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.dir_inode), copy_value(expected_entry.header.dir_inode));
+ BOOST_CHECK_EQUAL(copy_value(given_entry.header.entry_name_length), copy_value(expected_entry.header.entry_name_length));

+}
+
+} // seastar::fs

diff --git a/tests/unit/fs_mock_cluster_writer.hh b/tests/unit/fs_mock_cluster_writer.hh
new file mode 100644
index 00000000..ea95e006
--- /dev/null
+++ b/tests/unit/fs_mock_cluster_writer.hh
@@ -0,0 +1,78 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#pragma once
+
+#include "fs/cluster_writer.hh"
+
+#include <seastar/core/shared_ptr.hh>
+#include <seastar/core/temporary_buffer.hh>
+#include <seastar/fs/block_device.hh>
+
+#include <cassert>
+#include <cstdlib>
+#include <vector>

+
+namespace seastar::fs {
+

+class mock_cluster_writer : public cluster_writer {
+public:
+ mock_cluster_writer() = default;
+
+ // A container with all the writers created by virtual_constructor
+ inline static thread_local std::vector<shared_ptr<mock_cluster_writer>> virtually_constructed_writers;
+
+ shared_ptr<cluster_writer> virtual_constructor() const override {
+ auto new_writer = make_shared<mock_cluster_writer>();
+ virtually_constructed_writers.emplace_back(new_writer);
+ return new_writer;
+ }
+
+ struct write_to_device {
+ disk_offset_t disk_offset;

+ temporary_buffer<uint8_t> data;
+ };
+

+ std::vector<write_to_device> writes;
+
+ using cluster_writer::init;
+ using cluster_writer::bytes_left;
+ using cluster_writer::current_disk_offset;
+
+ future<size_t> write(const void* aligned_buffer, size_t aligned_len, block_device device) override {

+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _alignment == 0);
+ assert(aligned_len % _alignment == 0);
+ assert(aligned_len <= bytes_left());
+

+ writes.emplace_back(write_to_device {
+ _cluster_beg_offset + _next_write_offset,
+ temporary_buffer<uint8_t>(static_cast<const uint8_t*>(aligned_buffer), aligned_len)
+ });
+

+ size_t curr_write_offset = _next_write_offset;
+ _next_write_offset += aligned_len;
+
+ return device.write(_cluster_beg_offset + curr_write_offset, aligned_buffer, aligned_len);
+ }
+

+};
+
+} // namespace seastar::fs

diff --git a/tests/unit/fs_mock_metadata_to_disk_buffer.hh b/tests/unit/fs_mock_metadata_to_disk_buffer.hh
new file mode 100644
index 00000000..a1215f27
--- /dev/null
+++ b/tests/unit/fs_mock_metadata_to_disk_buffer.hh
@@ -0,0 +1,323 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#pragma once
+
+#include "fs/metadata_to_disk_buffer.hh"
+

+#include <seastar/core/temporary_buffer.hh>
+#include <seastar/fs/overloaded.hh>
+

+#include <cstdint>
+#include <variant>
+#include <vector>

+
+namespace seastar::fs {
+

+struct ondisk_small_write {
+ ondisk_small_write_header header;

+ temporary_buffer<uint8_t> data;
+};
+

+struct ondisk_add_dir_entry {
+ ondisk_add_dir_entry_header header;
+ temporary_buffer<uint8_t> entry_name;
+};
+
+struct ondisk_create_inode_as_dir_entry {
+ ondisk_create_inode_as_dir_entry_header header;
+ temporary_buffer<uint8_t> entry_name;
+};
+
+struct ondisk_delete_dir_entry {
+ ondisk_delete_dir_entry_header header;
+ temporary_buffer<uint8_t> entry_name;
+};
+
+struct ondisk_delete_inode_and_dir_entry {
+ ondisk_delete_inode_and_dir_entry_header header;
+ temporary_buffer<uint8_t> entry_name;
+};
+
+class mock_metadata_to_disk_buffer : public metadata_to_disk_buffer {
+public:
+ mock_metadata_to_disk_buffer() = default;
+
+ // A container with all the buffers created by virtual_constructor
+ inline static thread_local std::vector<shared_ptr<mock_metadata_to_disk_buffer>> virtually_constructed_buffers;
+
+ shared_ptr<metadata_to_disk_buffer> virtual_constructor() const override {
+ auto new_buffer = make_shared<mock_metadata_to_disk_buffer>();
+ virtually_constructed_buffers.emplace_back(new_buffer);
+ return new_buffer;
+ }
+
+ struct action {
+ struct append {
+ using entry_data = std::variant<
+ ondisk_next_metadata_cluster,
+ ondisk_create_inode,
+ ondisk_delete_inode,
+ ondisk_small_write,
+ ondisk_medium_write,
+ ondisk_large_write,
+ ondisk_large_write_without_mtime,
+ ondisk_truncate,
+ ondisk_add_dir_entry,
+ ondisk_create_inode_as_dir_entry,
+ ondisk_delete_dir_entry,
+ ondisk_delete_inode_and_dir_entry>;
+
+ entry_data entry;
+ };
+
+ struct flush_to_disk {};
+
+ using action_data = std::variant<append, flush_to_disk>;
+
+ action_data data;
+
+ action(action_data data) : data(std::move(data)) {}
+ };
+ std::vector<action> actions;
+
+ template<typename T>
+ static bool is_type(const action& a) {
+ return std::holds_alternative<T>(a.data);
+ }
+
+ template<typename T>
+ static const T& get_by_type(const action& a) {
+ return std::get<T>(a.data);
+ }
+
+ template<typename T>
+ static bool is_append_type(const action& a) {
+ return is_type<action::append>(a) and std::holds_alternative<T>(get_by_type<action::append>(a).entry);
+ }
+
+ template<typename T>
+ static const T& get_by_append_type(const action& a) {
+ return std::get<T>(get_by_type<action::append>(a).entry);
+ }
+
+ void init(size_t aligned_max_size, unit_size_t alignment, disk_offset_t cluster_beg_offset) override {

+ assert(is_power_of_2(alignment));
+ assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
+ assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
+ _max_size = aligned_max_size;
+ _alignment = alignment;
+ _cluster_beg_offset = cluster_beg_offset;

+ _unflushed_data = {0, 0};

+ start_new_unflushed_data();
+ }
+
+ void init_from_bootstrapped_cluster(size_t aligned_max_size, unit_size_t alignment,
+ disk_offset_t cluster_beg_offset, size_t metadata_end_pos) noexcept override {

+ assert(is_power_of_2(alignment));
+ assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
+ assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
+

+ assert(aligned_max_size >= sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ assert(alignment >= sizeof(ondisk_type) + sizeof(ondisk_checkpoint) + sizeof(ondisk_type) +
+ sizeof(ondisk_next_metadata_cluster) and
+ "We always need to be able to pack at least a checkpoint and next_metadata_cluster entry to the last "
+ "data flush in the cluster");
+ assert(metadata_end_pos < aligned_max_size);

+
+ _max_size = aligned_max_size;
+ _alignment = alignment;
+ _cluster_beg_offset = cluster_beg_offset;

+ auto aligned_pos = round_up_to_multiple_of_power_of_2(metadata_end_pos, _alignment);
+ _unflushed_data = {aligned_pos, aligned_pos};
+

+ start_new_unflushed_data();
+ }
+
+ future<> flush_to_disk([[maybe_unused]] block_device device) override {
+ actions.emplace_back(action::flush_to_disk {});
+ prepare_unflushed_data_for_flush();
+

+ assert(mod_by_power_of_2(_unflushed_data.beg, _alignment) == 0);
+ range real_write = {
+ _unflushed_data.beg,
+ round_up_to_multiple_of_power_of_2(_unflushed_data.end, _alignment),
+ };

+

+ // Make sure the buffer is usable before returning from this function
+ _unflushed_data = {real_write.end, real_write.end};
+ start_new_unflushed_data();
+

+ return now();
+ }
+

+private:
+ void mock_append_bytes(size_t len) {
+ assert(len <= bytes_left());

+ _unflushed_data.end += len;
+ }
+

+ append_result mock_append(size_t data_len) {
+ if (not fits_for_append(data_len)) {
+ return TOO_BIG;
+ }
+ mock_append_bytes(data_len);

+ return APPENDED;
+ }
+

+ void start_new_unflushed_data() noexcept override {
+ if (bytes_left() < sizeof(ondisk_type) + sizeof(ondisk_checkpoint) + sizeof(ondisk_type) +
+ sizeof(ondisk_next_metadata_cluster)) {
+ assert(bytes_left() == 0); // alignment has to be big enough to hold checkpoint and next_metadata_cluster
+ return; // No more space
+ }

+ mock_append_bytes(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));

+ }
+
+ void prepare_unflushed_data_for_flush() noexcept override {}
+

+public:
+ using metadata_to_disk_buffer::bytes_left_after_flush_if_done_now;
+ using metadata_to_disk_buffer::bytes_left;
+ using metadata_to_disk_buffer::append_result;
+
+ append_result append(const ondisk_next_metadata_cluster& next_metadata_cluster) noexcept override {
+ size_t len = ondisk_entry_size(next_metadata_cluster);
+ if (bytes_left() < len) {
+ return TOO_BIG;
+ }
+ actions.emplace_back(action::append {ondisk_next_metadata_cluster {next_metadata_cluster}});
+ mock_append_bytes(len);

+ return APPENDED;
+ }
+

+ append_result append(const ondisk_create_inode& create_inode) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(create_inode));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_create_inode {create_inode}});

+ }
+ return ret;
+ }
+

+ append_result append(const ondisk_delete_inode& delete_inode) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(delete_inode));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_delete_inode {delete_inode}});

+ }
+ return ret;
+ }
+

+ append_result append(const ondisk_medium_write& medium_write) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(medium_write));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_medium_write {medium_write}});

+ }
+ return ret;
+ }
+

+ append_result append(const ondisk_large_write& large_write) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(large_write));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_large_write {large_write}});

+ }
+ return ret;
+ }
+

+ append_result append(const ondisk_large_write_without_mtime& large_write_without_mtime) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(large_write_without_mtime));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_large_write_without_mtime {large_write_without_mtime}});

+ }
+ return ret;
+ }
+

+ append_result append(const ondisk_truncate& truncate) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(truncate));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_truncate {truncate}});

+ }
+ return ret;
+ }
+

+private:
+ static temporary_buffer<uint8_t> copy_data(const void* data, size_t length) {
+ return temporary_buffer<uint8_t>(static_cast<const uint8_t*>(data), length);
+ }
+
+public:
+ append_result append(const ondisk_small_write_header& small_write, const void* data) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(small_write));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_small_write {
+ small_write,
+ copy_data(data, small_write.length)
+ }});
+ }
+ return ret;
+ }
+
+ append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(add_dir_entry));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_add_dir_entry {
+ add_dir_entry,
+ copy_data(entry_name, add_dir_entry.entry_name_length)
+ }});
+ }
+ return ret;
+ }
+
+ append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,
+ const void* entry_name) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(create_inode_as_dir_entry));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_create_inode_as_dir_entry {
+ create_inode_as_dir_entry,
+ copy_data(entry_name, create_inode_as_dir_entry.entry_name_length)
+ }});
+ }
+ return ret;
+ }
+
+ append_result append(const ondisk_delete_dir_entry_header& delete_dir_entry, const void* entry_name) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(delete_dir_entry));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_delete_dir_entry {
+ delete_dir_entry,
+ copy_data(entry_name, delete_dir_entry.entry_name_length)
+ }});
+ }
+ return ret;
+ }
+
+ append_result append(const ondisk_delete_inode_and_dir_entry_header& delete_inode_and_dir_entry, const void* entry_name) noexcept override {
+ append_result ret = mock_append(ondisk_entry_size(delete_inode_and_dir_entry));
+ if (ret == APPENDED) {
+ actions.emplace_back(action::append {ondisk_delete_inode_and_dir_entry {
+ delete_inode_and_dir_entry,
+ copy_data(entry_name, delete_inode_and_dir_entry.entry_name_length)
+ }});
+ }
+ return ret;
+ }

+};
+
+} // namespace seastar::fs

diff --git a/tests/unit/fs_mock_metadata_to_disk_buffer_test.cc b/tests/unit/fs_mock_metadata_to_disk_buffer_test.cc
new file mode 100644
index 00000000..fbf428a3
--- /dev/null
+++ b/tests/unit/fs_mock_metadata_to_disk_buffer_test.cc
@@ -0,0 +1,357 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#include "fs_metadata_common.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/metadata_to_disk_buffer.hh"
+#include "fs_mock_metadata_to_disk_buffer.hh"
+#include "fs_mock_block_device.hh"
+
+#include <seastar/core/print.hh>
+#include <seastar/core/temporary_buffer.hh>
+#include <seastar/core/units.hh>

+#include <seastar/testing/test_case.hh>
+#include <seastar/testing/test_runner.hh>
+#include <seastar/testing/thread_test_case.hh>

+#include <cstdint>

+
+using namespace seastar;
+using namespace seastar::fs;
+

+namespace {
+
+constexpr auto APPENDED = mock_metadata_to_disk_buffer::APPENDED;
+constexpr auto TOO_BIG = mock_metadata_to_disk_buffer::TOO_BIG;
+
+constexpr size_t default_buff_size = 1 * MB;
+constexpr unit_size_t default_alignment = 4 * KB;
+
+template<typename MetadataToDiskBuffer>
+shared_ptr<MetadataToDiskBuffer> create_metadata_buffer() {
+ auto buf = make_shared<MetadataToDiskBuffer>();
+ buf->init(default_buff_size, default_alignment, 0);
+ return buf;
+}
+
+temporary_buffer<uint8_t> tmp_buff_from_string(const char* str) {
+ return temporary_buffer<uint8_t>(reinterpret_cast<const uint8_t*>(str), std::strlen(str));

+}
+
+} // namespace

+
+// The following test checks if exceeding append is properly handled
+SEASTAR_THREAD_TEST_CASE(size_exceeded_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ ondisk_delete_inode entry {1};
+ for (;;) {
+ if (buf->append(entry) == APPENDED) {
+ BOOST_REQUIRE_EQUAL(mock_buf->append(entry), APPENDED);
+ } else {
+ BOOST_REQUIRE_EQUAL(mock_buf->append(entry), TOO_BIG);
+ BOOST_REQUIRE_EQUAL(buf->bytes_left(), mock_buf->bytes_left());
+ break;
+ }
+ BOOST_REQUIRE_EQUAL(buf->bytes_left(), mock_buf->bytes_left());
+ }
+}
+
+// The following test checks if multiple actions data are correctly added to actions vector
+SEASTAR_THREAD_TEST_CASE(actions_index_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();

+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+

+ BOOST_REQUIRE_EQUAL(mock_buf->append(ondisk_next_metadata_cluster {1}), APPENDED);
+ mock_buf->flush_to_disk(dev).get();
+ BOOST_REQUIRE_EQUAL(mock_buf->append(ondisk_create_inode {4, 1, {5, 2, 6, 8, 4}}), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(ondisk_delete_inode {1}), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(ondisk_medium_write {1, 8, 4, 6, 9}), APPENDED);
+ mock_buf->flush_to_disk(dev).get();
+ BOOST_REQUIRE_EQUAL(mock_buf->append(ondisk_large_write {6, 8, 2, 5}), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(ondisk_large_write_without_mtime {6, 8, 1}), APPENDED);
+ mock_buf->flush_to_disk(dev).get();
+ BOOST_REQUIRE_EQUAL(mock_buf->append(ondisk_truncate {64, 28, 62}), APPENDED);
+ mock_buf->flush_to_disk(dev).get();
+
+ auto& actions = mock_buf->actions;
+ BOOST_REQUIRE(is_append_type<ondisk_next_metadata_cluster>(actions[0]));
+ BOOST_REQUIRE(is_type<flush_to_disk_action>(actions[1]));
+ BOOST_REQUIRE(is_append_type<ondisk_create_inode>(actions[2]));
+ BOOST_REQUIRE(is_append_type<ondisk_delete_inode>(actions[3]));
+ BOOST_REQUIRE(is_append_type<ondisk_medium_write>(actions[4]));
+ BOOST_REQUIRE(is_type<flush_to_disk_action>(actions[5]));
+ BOOST_REQUIRE(is_append_type<ondisk_large_write>(actions[6]));
+ BOOST_REQUIRE(is_append_type<ondisk_large_write_without_mtime>(actions[7]));
+ BOOST_REQUIRE(is_type<flush_to_disk_action>(actions[8]));
+ BOOST_REQUIRE(is_append_type<ondisk_truncate>(actions[9]));
+ BOOST_REQUIRE(is_type<flush_to_disk_action>(actions[10]));
+}
+
+// The folowing test checks that constructed buffers are distinct and correctly added to
+// virtually_constructed_buffers vector
+SEASTAR_THREAD_TEST_CASE(virtual_constructor_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto& created_buffers = mock_metadata_to_disk_buffer::virtually_constructed_buffers;
+
+ auto mock_buf0 = mock_metadata_to_disk_buffer().virtual_constructor();
+ mock_buf0->init(default_buff_size, default_alignment, 0);
+ BOOST_REQUIRE_EQUAL(created_buffers.size(), 1);
+ BOOST_REQUIRE(created_buffers[0] == mock_buf0);
+
+ auto mock_buf1 = mock_buf0->virtual_constructor();
+ mock_buf1->init(default_buff_size, default_alignment, 0);
+ BOOST_REQUIRE_EQUAL(created_buffers.size(), 2);
+ BOOST_REQUIRE(created_buffers[1] == mock_buf1);
+
+ auto mock_buf2 = mock_buf1->virtual_constructor();
+ mock_buf2->init(default_buff_size, default_alignment, 0);
+ BOOST_REQUIRE_EQUAL(created_buffers.size(), 3);
+ BOOST_REQUIRE(created_buffers[2] == mock_buf2);
+
+
+ BOOST_REQUIRE_EQUAL(created_buffers[0]->actions.size(), 0);
+
+ BOOST_REQUIRE_EQUAL(mock_buf1->append(ondisk_delete_inode {1}), APPENDED);
+ BOOST_REQUIRE_EQUAL(created_buffers[1]->actions.size(), 1);
+
+ BOOST_REQUIRE_EQUAL(mock_buf2->append(ondisk_delete_inode {1}), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf2->append(ondisk_delete_inode {1}), APPENDED);
+ BOOST_REQUIRE_EQUAL(created_buffers[2]->actions.size(), 2);
+}
+
+// Tests below check that actions add correct info to actions vector
+
+SEASTAR_THREAD_TEST_CASE(flush_action_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();

+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+

+ mock_buf->flush_to_disk(dev).get();
+
+ BOOST_REQUIRE(mock_metadata_to_disk_buffer::is_type<flush_to_disk_action>(mock_buf->actions[0]));
+}
+
+SEASTAR_THREAD_TEST_CASE(next_metadata_cluster_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ ondisk_next_metadata_cluster next_metadata_cluster_op {6};
+ BOOST_REQUIRE_EQUAL(buf->append(next_metadata_cluster_op), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(next_metadata_cluster_op), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], next_metadata_cluster_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(create_inode_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ ondisk_create_inode create_inode_op {42, 1, {5, 2, 6, 8, 4}};
+ BOOST_REQUIRE_EQUAL(buf->append(create_inode_op), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(create_inode_op), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], create_inode_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(delete_inode_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ ondisk_delete_inode delete_inode_op {1};
+ BOOST_REQUIRE_EQUAL(buf->append(delete_inode_op), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(delete_inode_op), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], delete_inode_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(small_write_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ temporary_buffer<uint8_t> small_write_str = tmp_buff_from_string("12345");
+ ondisk_small_write small_write_op {
+ {
+ 2,
+ 7,
+ static_cast<uint16_t>(small_write_str.size()),
+ 17
+ },
+ small_write_str.share()
+ };
+ BOOST_REQUIRE_EQUAL(buf->append(small_write_op.header, small_write_op.data.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(small_write_op.header, small_write_op.data.get()), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], small_write_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(medium_write_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ ondisk_medium_write medium_write_op {1, 8, 4, 6, 9};
+ BOOST_REQUIRE_EQUAL(buf->append(medium_write_op), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(medium_write_op), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], medium_write_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(large_write_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ ondisk_large_write large_write_op {6, 8, 2, 5};
+ BOOST_REQUIRE_EQUAL(buf->append(large_write_op), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(large_write_op), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], large_write_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(large_write_without_mtime_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ ondisk_large_write_without_mtime large_write_op {256, 88, 11};
+ BOOST_REQUIRE_EQUAL(buf->append(large_write_op), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(large_write_op), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], large_write_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(truncate_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ ondisk_truncate truncate_op {64, 28, 62};
+ BOOST_REQUIRE_EQUAL(buf->append(truncate_op), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(truncate_op), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], truncate_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(add_dir_entry_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ temporary_buffer<uint8_t> add_dir_entry_str = tmp_buff_from_string("120345");
+ ondisk_add_dir_entry add_dir_entry_op {
+ {
+ 2,
+ 7,
+ static_cast<uint16_t>(add_dir_entry_str.size())
+ },
+ add_dir_entry_str.share()
+ };
+ BOOST_REQUIRE_EQUAL(buf->append(add_dir_entry_op.header, add_dir_entry_op.entry_name.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(add_dir_entry_op.header, add_dir_entry_op.entry_name.get()), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], add_dir_entry_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(create_inode_as_dir_entry_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ temporary_buffer<uint8_t> create_inode_str = tmp_buff_from_string("120345");
+ ondisk_create_inode_as_dir_entry create_inode_op {
+ {
+ {
+ 42,
+ 1,
+ {5, 2, 6, 8, 4}
+ },
+ 7,
+ static_cast<uint16_t>(create_inode_str.size())
+ },
+ create_inode_str.share()
+ };
+ BOOST_REQUIRE_EQUAL(buf->append(create_inode_op.header, create_inode_op.entry_name.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(create_inode_op.header, create_inode_op.entry_name.get()), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], create_inode_op));
+}
+
+SEASTAR_THREAD_TEST_CASE(delete_dir_entry_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());
+ auto mock_buf = create_metadata_buffer<mock_metadata_to_disk_buffer>();
+ auto buf = create_metadata_buffer<metadata_to_disk_buffer>();
+
+ temporary_buffer<uint8_t> delete_entry_str = tmp_buff_from_string("120345");
+ ondisk_delete_dir_entry delete_entry_op {
+ {
+ 42,
+ static_cast<uint16_t>(delete_entry_str.size())
+ },
+ delete_entry_str.share()
+ };
+ BOOST_REQUIRE_EQUAL(buf->append(delete_entry_op.header, delete_entry_op.entry_name.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(mock_buf->append(delete_entry_op.header, delete_entry_op.entry_name.get()), APPENDED);
+
+ BOOST_REQUIRE_EQUAL(mock_buf->bytes_left(), buf->bytes_left());
+
+ BOOST_REQUIRE_EQUAL(mock_buf->actions.size(), 1);
+ CHECK_CALL(check_metadata_entries_equal(mock_buf->actions[0], delete_entry_op));

+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt

index 520546bc..d2480b2e 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -372,6 +372,10 @@ if (Seastar_EXPERIMENTAL_FS)

seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)

+ seastar_add_test (fs_mock_metadata_to_disk_buffer
+ SOURCES
+ fs_mock_metadata_to_disk_buffer_test.cc
+ fs_mock_block_device.cc)
seastar_add_test (fs_path
KIND BOOST
SOURCES fs_path_test.cc)
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:46 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

- random tests
- tests for corner cases
* basic single small writes
* basic single medium writes
* basic single large writes
* new cluster allocation for medium writes
* medium write split into two smaller writes due to lack of space in
data-log cluster
* split single write into more smaller writes because of unaligned
buffer
* split big write (bigger than cluster size) into multiple writes

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

tests/unit/fs_write_test.cc | 782 ++++++++++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 4 +
2 files changed, 786 insertions(+)
create mode 100644 tests/unit/fs_write_test.cc

diff --git a/tests/unit/fs_write_test.cc b/tests/unit/fs_write_test.cc
new file mode 100644
index 00000000..c7e973ae
--- /dev/null
+++ b/tests/unit/fs_write_test.cc
@@ -0,0 +1,782 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+
+#include "fs/bitwise.hh"
+#include "fs/cluster.hh"

+#include "fs/metadata_log_operations/write.hh"
+#include "fs/units.hh"
+#include "fs_metadata_common.hh"

+#include "fs_mock_block_device.hh"
+#include "fs_mock_cluster_writer.hh"
+#include "fs_mock_metadata_to_disk_buffer.hh"
+

+#include <seastar/core/temporary_buffer.hh>
+#include <seastar/core/units.hh>
+#include <seastar/testing/thread_test_case.hh>
+
+#include <assert.h>
+#include <cstdint>
+#include <string>

+#include <utility>
+#include <vector>
+

+using namespace seastar;
+using namespace seastar::fs;
+
+namespace {
+

+using resizable_buff_type = basic_sstring<uint8_t, size_t, 32, false>;
+
+constexpr unit_size_t default_cluster_size = 1 * MB;
+constexpr unit_size_t default_alignment = 4096;
+constexpr cluster_range default_cluster_range = {1, 10};
+constexpr cluster_id_t default_metadata_log_cluster = 1;
+constexpr medium_write_len_t min_medium_write_len =
+ round_up_to_multiple_of_power_of_2(SMALL_WRITE_THRESHOLD + 1, default_alignment);
+constexpr size_t random_read_checks_nb = 100;
+
+enum class write_type {
+ SMALL,
+ MEDIUM,
+ LARGE
+};
+
+constexpr write_type get_write_type(size_t len) noexcept {
+ return len <= SMALL_WRITE_THRESHOLD ? write_type::SMALL :
+ (len >= default_cluster_size ? write_type::LARGE : write_type::MEDIUM);
+}
+
+auto default_init_metadata_log() {
+ return init_metadata_log(default_cluster_size, default_alignment, default_metadata_log_cluster, default_cluster_range);
+}
+
+auto default_gen_buffer(size_t len, bool aligned) {
+ return gen_buffer(len, aligned, default_alignment);
+}
+
+void write_with_simulate(metadata_log& log, inode_t inode, file_offset_t write_offset, temporary_buffer<uint8_t>& buff,
+ resizable_buff_type& real_file_data) {
+ if (real_file_data.size() < write_offset + buff.size())
+ real_file_data.resize(write_offset + buff.size());
+ std::memcpy(real_file_data.data() + write_offset, buff.get(), buff.size());
+
+ BOOST_REQUIRE_EQUAL(log.write(inode, write_offset, buff.get(), buff.size()).get0(), buff.size());
+}
+
+void random_write_with_simulate(metadata_log& log, inode_t inode, file_offset_t write_offset, size_t bytes_num, bool aligned,
+ resizable_buff_type& real_file_data) {
+ temporary_buffer<uint8_t> buff = default_gen_buffer(bytes_num, aligned);
+ write_with_simulate(log, inode, write_offset, buff, real_file_data);
+}
+
+void check_random_reads(metadata_log& log, inode_t inode, resizable_buff_type& real_file_data, size_t reps) {
+ size_t file_size = real_file_data.size();

+ std::default_random_engine random_engine(testing::local_random_engine());
+

+ {
+ // Check random reads inside the file
+ std::uniform_int_distribution<file_offset_t> distr(0, file_size - 1);
+ for (size_t rep = 0; rep < reps; ++rep) {
+ auto a = distr(random_engine);
+ auto b = distr(random_engine);
+ if (a > b)
+ std::swap(a, b);
+ size_t max_read_size = b - a + 1;
+ temporary_buffer<uint8_t> read_data(max_read_size);
+ BOOST_REQUIRE_EQUAL(log.read(inode, a, read_data.get_write(), max_read_size).get0(), max_read_size);
+ BOOST_REQUIRE(std::memcmp(real_file_data.c_str() + a, read_data.get(), max_read_size) == 0);
+ }
+ }
+
+ {
+ // Check random reads outside the file
+ std::uniform_int_distribution<file_offset_t> distr(file_size, 2 * file_size);
+ for (size_t rep = 0; rep < reps; ++rep) {
+ auto a = distr(random_engine);
+ auto b = distr(random_engine);
+ if (a > b)
+ std::swap(a, b);
+ size_t max_read_size = b - a + 1;
+ temporary_buffer<uint8_t> read_data(max_read_size);
+ BOOST_REQUIRE_EQUAL(log.read(inode, a, read_data.get_write(), max_read_size).get0(), 0);
+ }
+ }
+
+ {
+ // Check random reads on the edge of the file
+ std::uniform_int_distribution<file_offset_t> distra(0, file_size - 1);
+ std::uniform_int_distribution<file_offset_t> distrb(file_size, 2 * file_size);
+ for (size_t rep = 0; rep < reps; ++rep) {
+ auto a = distra(random_engine);
+ auto b = distrb(random_engine);
+ size_t max_read_size = b - a + 1;
+ size_t expected_read_size = file_size - a;
+ temporary_buffer<uint8_t> read_data(max_read_size);
+ BOOST_REQUIRE_EQUAL(log.read(inode, a, read_data.get_write(), max_read_size).get0(), expected_read_size);
+ BOOST_REQUIRE(std::memcmp(real_file_data.c_str() + a, read_data.get(), expected_read_size) == 0);

+ }
+ }
+
+}
+

+} // namespace

+
+SEASTAR_THREAD_TEST_CASE(small_write_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 7312;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ for (auto write_len : std::vector<small_write_len_t> {
+ 1,
+ 10,
+ SMALL_WRITE_THRESHOLD}) {
+ assert(get_write_type(write_len) == write_type::SMALL);
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ BOOST_TEST_MESSAGE("write_len: " << write_len);
+ temporary_buffer<uint8_t> buff = default_gen_buffer(write_len, false);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 2);
+ ondisk_small_write expected_entry {
+ ondisk_small_write_header {
+ inode,
+ write_offset,
+ write_len,
+ time_ns_start
+ },
+ buff.share()
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[1], expected_entry));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+ }

+}
+
+SEASTAR_THREAD_TEST_CASE(medium_write_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 7331;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ for (auto write_len : std::vector<medium_write_len_t> {
+ min_medium_write_len,
+ default_cluster_size - default_alignment,
+ round_up_to_multiple_of_power_of_2(default_cluster_size / 3 + 10, default_alignment)}) {
+ assert(write_len % default_alignment == 0);
+ assert(get_write_type(write_len) == write_type::MEDIUM);
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ BOOST_TEST_MESSAGE("write_len: " << write_len);
+ temporary_buffer<uint8_t> buff = default_gen_buffer(write_len, true);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check data
+ auto clst_writer = get_current_cluster_writer();
+ BOOST_REQUIRE_EQUAL(clst_writer->writes.size(), 1);
+ auto& clst_writer_ops = clst_writer->writes;
+ BOOST_REQUIRE_EQUAL(clst_writer_ops[0].data, buff);
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 2);
+ ondisk_medium_write expected_entry {
+ inode,
+ write_offset,
+ clst_writer_ops[0].disk_offset,
+ write_len,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[1], expected_entry));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+ }
+}
+
+SEASTAR_THREAD_TEST_CASE(second_medium_write_without_new_data_cluster_allocation_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 1337;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ constexpr medium_write_len_t second_write_len =
+ round_up_to_multiple_of_power_of_2(2 * SMALL_WRITE_THRESHOLD + 1, default_alignment);
+ static_assert(get_write_type(second_write_len) == write_type::MEDIUM);
+ for (auto first_write_len : std::vector<medium_write_len_t> {
+ default_cluster_size - second_write_len - default_alignment,
+ default_cluster_size - second_write_len}) {
+ assert(first_write_len % default_alignment == 0);
+ assert(get_write_type(first_write_len) == write_type::MEDIUM);
+ medium_write_len_t remaining_space_in_cluster = default_cluster_size - first_write_len;
+ assert(remaining_space_in_cluster >= SMALL_WRITE_THRESHOLD);
+ assert(remaining_space_in_cluster >= second_write_len);
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ // After that write remaining_space_in_cluster bytes should remain in internal cluster_writer
+ BOOST_TEST_MESSAGE("first write len: " << first_write_len);
+ CHECK_CALL(random_write_with_simulate(log, inode, 0, first_write_len, true, real_file_data));
+
+ BOOST_TEST_MESSAGE("second write len: " << second_write_len);
+ size_t nb_of_cluster_writers_before = mock_cluster_writer::virtually_constructed_writers.size();
+ temporary_buffer<uint8_t> buff = default_gen_buffer(second_write_len, true);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+ size_t nb_of_cluster_writers_after = mock_cluster_writer::virtually_constructed_writers.size();
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check cluster writer data
+ BOOST_REQUIRE_EQUAL(nb_of_cluster_writers_before, nb_of_cluster_writers_after);
+ auto clst_writer = get_current_cluster_writer();
+ auto& clst_writer_ops = clst_writer->writes;
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.size(), 2);
+ BOOST_REQUIRE_EQUAL(clst_writer_ops[1].data, buff);
+ BOOST_REQUIRE_EQUAL(clst_writer_ops[1].disk_offset, clst_writer_ops[0].disk_offset + first_write_len);
+
+ // Check metadata to disk buffer entries
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 3);
+ ondisk_medium_write expected_entry {
+ inode,
+ write_offset,
+ clst_writer_ops[1].disk_offset,
+ second_write_len,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[2], expected_entry));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+ }
+}
+
+SEASTAR_THREAD_TEST_CASE(second_medium_write_with_new_data_cluster_allocation_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 1337;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ constexpr medium_write_len_t second_write_len = min_medium_write_len;
+ static_assert(get_write_type(second_write_len) == write_type::MEDIUM);
+ for (auto first_write_len : std::vector<medium_write_len_t> {
+ default_cluster_size - default_alignment,
+ default_cluster_size - round_down_to_multiple_of_power_of_2(SMALL_WRITE_THRESHOLD, default_alignment)}) {
+ assert(first_write_len % default_alignment == 0);
+ assert(get_write_type(first_write_len) == write_type::MEDIUM);
+ medium_write_len_t remaining_space_in_cluster = default_cluster_size - first_write_len;
+ assert(remaining_space_in_cluster <= SMALL_WRITE_THRESHOLD);
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ // After that write remaining_space_in_cluster bytes should remain in internal cluster_writer
+ BOOST_TEST_MESSAGE("first write len: " << first_write_len);
+ CHECK_CALL(random_write_with_simulate(log, inode, 0, first_write_len, true, real_file_data));
+
+ BOOST_TEST_MESSAGE("second write len: " << second_write_len);
+ size_t nb_of_cluster_writers_before = mock_cluster_writer::virtually_constructed_writers.size();
+ temporary_buffer<uint8_t> buff = default_gen_buffer(second_write_len, true);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+ size_t nb_of_cluster_writers_after = mock_cluster_writer::virtually_constructed_writers.size();
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check data
+ BOOST_REQUIRE_EQUAL(nb_of_cluster_writers_before + 1, nb_of_cluster_writers_after);
+ auto clst_writer = get_current_cluster_writer();
+ auto& clst_writer_ops = clst_writer->writes;
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.size(), 1);
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.back().data, buff);
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 3);
+ ondisk_medium_write expected_entry {
+ inode,
+ write_offset,
+ clst_writer_ops[0].disk_offset,
+ second_write_len,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[2], expected_entry));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+ }
+}
+
+SEASTAR_THREAD_TEST_CASE(split_medium_write_with_small_write_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 1337;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ for (auto [first_write_len, second_write_len] : std::vector<std::pair<medium_write_len_t, medium_write_len_t>> {
+ {
+ default_cluster_size - min_medium_write_len,
+ min_medium_write_len + SMALL_WRITE_THRESHOLD
+ },
+ {
+ default_cluster_size - 3 * min_medium_write_len,
+ 3 * min_medium_write_len + 1
+ }}) {
+ assert(first_write_len % default_alignment == 0);
+ assert(get_write_type(first_write_len) == write_type::MEDIUM);
+ assert(get_write_type(second_write_len) == write_type::MEDIUM);
+ medium_write_len_t remaining_space_in_cluster = default_cluster_size - first_write_len;
+ assert(remaining_space_in_cluster > SMALL_WRITE_THRESHOLD);
+ assert(remaining_space_in_cluster < second_write_len);
+ assert(get_write_type(second_write_len - remaining_space_in_cluster) == write_type::SMALL);
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ // After that write remaining_space_in_cluster bytes should remain in internal cluster_writer
+ BOOST_TEST_MESSAGE("first write len: " << first_write_len);
+ CHECK_CALL(random_write_with_simulate(log, inode, 0, first_write_len, true, real_file_data));
+
+ BOOST_TEST_MESSAGE("second write len: " << second_write_len);
+ size_t nb_of_cluster_writers_before = mock_cluster_writer::virtually_constructed_writers.size();
+ temporary_buffer<uint8_t> buff = default_gen_buffer(second_write_len, true);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+ size_t nb_of_cluster_writers_after = mock_cluster_writer::virtually_constructed_writers.size();
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check data
+ BOOST_REQUIRE_EQUAL(nb_of_cluster_writers_before, nb_of_cluster_writers_after);
+ auto clst_writer = get_current_cluster_writer();
+ auto& clst_writer_ops = clst_writer->writes;
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.size(), 2);
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.back().data, buff.share(0, remaining_space_in_cluster));
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 4);
+ ondisk_medium_write expected_entry1 {
+ inode,
+ write_offset,
+ clst_writer_ops[1].disk_offset,
+ remaining_space_in_cluster,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[2], expected_entry1));
+ ondisk_small_write expected_entry2 {
+ ondisk_small_write_header {
+ inode,
+ write_offset + remaining_space_in_cluster,
+ static_cast<small_write_len_t>(second_write_len - remaining_space_in_cluster),
+ time_ns_start
+ },
+ buff.share(remaining_space_in_cluster, second_write_len - remaining_space_in_cluster)
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[3], expected_entry2));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+ }
+}
+
+SEASTAR_THREAD_TEST_CASE(split_medium_write_with_medium_write_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 1337;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ for (auto [first_write_len, second_write_len] : std::vector<std::pair<medium_write_len_t, medium_write_len_t>> {
+ {
+ default_cluster_size - min_medium_write_len,
+ 2 * min_medium_write_len
+ },
+ {
+ default_cluster_size - min_medium_write_len,
+ default_cluster_size - default_alignment
+ }}) {
+ assert(first_write_len % default_alignment == 0);
+ assert(get_write_type(first_write_len) == write_type::MEDIUM);
+ assert(get_write_type(second_write_len) == write_type::MEDIUM);
+ medium_write_len_t remaining_space_in_cluster = default_cluster_size - first_write_len;
+ assert(remaining_space_in_cluster > SMALL_WRITE_THRESHOLD);
+ assert(remaining_space_in_cluster < second_write_len);
+ assert(get_write_type(second_write_len - remaining_space_in_cluster) == write_type::MEDIUM);
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ // After that write remaining_space_in_cluster bytes should remain in internal cluster_writer
+ BOOST_TEST_MESSAGE("first write len: " << first_write_len);
+ CHECK_CALL(random_write_with_simulate(log, inode, 0, first_write_len, true, real_file_data));
+
+ BOOST_TEST_MESSAGE("second write len: " << second_write_len);
+ size_t nb_of_cluster_writers_before = mock_cluster_writer::virtually_constructed_writers.size();
+ temporary_buffer<uint8_t> buff = default_gen_buffer(second_write_len, true);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+ size_t nb_of_cluster_writers_after = mock_cluster_writer::virtually_constructed_writers.size();
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check data
+ BOOST_REQUIRE_EQUAL(nb_of_cluster_writers_before + 1, nb_of_cluster_writers_after);
+ auto clst_writer = get_current_cluster_writer();
+ auto& prev_clst_writer =
+ mock_cluster_writer::virtually_constructed_writers[mock_cluster_writer::virtually_constructed_writers.size() - 2];
+ auto& clst_writer_ops = clst_writer->writes;
+ auto& prev_clst_writer_ops = prev_clst_writer->writes;
+ BOOST_REQUIRE_EQUAL(prev_clst_writer_ops.size(), 2);
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.size(), 1);
+ BOOST_REQUIRE_EQUAL(prev_clst_writer_ops.back().data, buff.share(0, remaining_space_in_cluster));
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.back().data,
+ buff.share(remaining_space_in_cluster, second_write_len - remaining_space_in_cluster));
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 4);
+ ondisk_medium_write expected_entry1 {
+ inode,
+ write_offset,
+ prev_clst_writer_ops[1].disk_offset,
+ remaining_space_in_cluster,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[2], expected_entry1));
+ ondisk_medium_write expected_entry2 {
+ inode,
+ write_offset + remaining_space_in_cluster,
+ clst_writer_ops[0].disk_offset,
+ second_write_len - remaining_space_in_cluster,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[3], expected_entry2));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+ }

+}
+
+SEASTAR_THREAD_TEST_CASE(large_write_test) {
+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 7331;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ constexpr uint64_t write_len = default_cluster_size * 2;
+ // TODO: asserts
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ BOOST_TEST_MESSAGE("write_len: " << write_len);
+ temporary_buffer<uint8_t> buff = default_gen_buffer(write_len, true);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check data
+ auto& blockdev_ops = blockdev->writes;
+ BOOST_REQUIRE_EQUAL(blockdev_ops.size(), 2);
+ BOOST_REQUIRE_EQUAL(blockdev_ops[0].disk_offset % default_cluster_size, 0);
+ BOOST_REQUIRE_EQUAL(blockdev_ops[1].disk_offset % default_cluster_size, 0);
+ cluster_id_t part1_cluster_id = blockdev_ops[0].disk_offset / default_cluster_size;
+ cluster_id_t part2_cluster_id = blockdev_ops[1].disk_offset / default_cluster_size;
+ BOOST_REQUIRE_EQUAL(blockdev_ops[0].data, buff.share(0, default_cluster_size));
+ BOOST_REQUIRE_EQUAL(blockdev_ops[1].data, buff.share(default_cluster_size, default_cluster_size));
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 3);
+ ondisk_large_write expected_entry1 {
+ inode,
+ write_offset,
+ part1_cluster_id,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[1], expected_entry1));
+ ondisk_large_write_without_mtime expected_entry2 {
+ inode,
+ write_offset + default_cluster_size,
+ part2_cluster_id
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[2], expected_entry2));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+}
+
+SEASTAR_THREAD_TEST_CASE(unaligned_write_split_into_two_small_writes_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 7331;
+ const unix_time_t time_ns_start = get_current_time_ns();
+
+ // medium write split into two small writes
+ constexpr medium_write_len_t write_len = SMALL_WRITE_THRESHOLD + 1;
+ // TODO: asserts
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ BOOST_TEST_MESSAGE("write_len: " << write_len);
+ temporary_buffer<uint8_t> buff = default_gen_buffer(write_len, false);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ auto misalignment = reinterpret_cast<uintptr_t>(buff.get()) % default_alignment;
+ small_write_len_t part1_write_len = default_alignment - misalignment;
+ small_write_len_t part2_write_len = write_len - part1_write_len;
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 3);
+ ondisk_small_write expected_entry1 {
+ ondisk_small_write_header {
+ inode,
+ write_offset,
+ part1_write_len,
+ time_ns_start
+ },
+ buff.share(0, part1_write_len)
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[1], expected_entry1));
+ ondisk_small_write expected_entry2 {
+ ondisk_small_write_header {
+ inode,
+ write_offset + part1_write_len,
+ part2_write_len,
+ time_ns_start
+ },
+ buff.share(part1_write_len, part2_write_len)
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[2], expected_entry2));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+}
+
+SEASTAR_THREAD_TEST_CASE(unaligned_write_split_into_small_medium_and_small_writes_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 7331;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ for (auto write_len : std::vector<unit_size_t> {
+ default_cluster_size - default_alignment,
+ default_cluster_size}) {
+ // TODO: asserts
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ BOOST_TEST_MESSAGE("write_len: " << write_len);
+ temporary_buffer<uint8_t> buff = default_gen_buffer(write_len, false);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ auto misalignment = reinterpret_cast<uintptr_t>(buff.get()) % default_alignment;
+ small_write_len_t part1_write_len = default_alignment - misalignment;
+ medium_write_len_t part2_write_len = round_down_to_multiple_of_power_of_2(write_len - part1_write_len, default_alignment);
+ small_write_len_t part3_write_len = write_len - part1_write_len - part2_write_len;
+
+ // Check data
+ auto clst_writer = get_current_cluster_writer();
+ auto& clst_writer_ops = clst_writer->writes;
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.size(), 1);
+ BOOST_REQUIRE_EQUAL(clst_writer_ops.back().data, buff.share(part1_write_len, part2_write_len));
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 4);
+ ondisk_small_write expected_entry1 {
+ ondisk_small_write_header {
+ inode,
+ write_offset,
+ part1_write_len,
+ time_ns_start
+ },
+ buff.share(0, part1_write_len)
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[1], expected_entry1));
+ ondisk_medium_write expected_entry2 {
+ inode,
+ write_offset + part1_write_len,
+ clst_writer_ops.back().disk_offset,
+ part2_write_len,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[2], expected_entry2));
+ ondisk_small_write expected_entry3 {
+ ondisk_small_write_header {
+ inode,
+ write_offset + part1_write_len + part2_write_len,
+ part3_write_len,
+ time_ns_start
+ },
+ buff.share(part1_write_len + part2_write_len, part3_write_len)
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[3], expected_entry3));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+ }
+}
+
+SEASTAR_THREAD_TEST_CASE(unaligned_write_split_into_small_large_and_small_writes_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 7331;
+ const unix_time_t time_ns_start = get_current_time_ns();
+ constexpr uint64_t write_len = default_cluster_size + 2 * default_alignment;
+ // TODO: asserts
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ BOOST_TEST_MESSAGE("write_len: " << write_len);
+ temporary_buffer<uint8_t> buff = default_gen_buffer(write_len, false);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ auto misalignment = reinterpret_cast<uintptr_t>(buff.get()) % default_alignment;
+ small_write_len_t part1_write_len = default_alignment - misalignment;
+ small_write_len_t part3_write_len = write_len - part1_write_len - default_cluster_size;
+
+ // Check data
+ auto& blockdev_ops = blockdev->writes;
+ BOOST_REQUIRE_EQUAL(blockdev_ops.size(), 1);
+ BOOST_REQUIRE_EQUAL(blockdev_ops.back().disk_offset % default_cluster_size, 0);
+ cluster_id_t part2_cluster_id = blockdev_ops.back().disk_offset / default_cluster_size;
+ BOOST_REQUIRE_EQUAL(blockdev_ops.back().data, buff.share(part1_write_len, default_cluster_size));
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 4);
+ ondisk_small_write expected_entry1 {
+ ondisk_small_write_header {
+ inode,
+ write_offset,
+ part1_write_len,
+ time_ns_start
+ },
+ buff.share(0, part1_write_len)
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[1], expected_entry1));
+ ondisk_large_write_without_mtime expected_entry2 {
+ inode,
+ write_offset + part1_write_len,
+ part2_cluster_id
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[2], expected_entry2));
+ ondisk_small_write expected_entry3 {
+ ondisk_small_write_header {
+ inode,
+ write_offset + part1_write_len + default_cluster_size,
+ part3_write_len,
+ time_ns_start
+ },
+ buff.share(part1_write_len + default_cluster_size, part3_write_len)
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[3], expected_entry3));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+}
+
+SEASTAR_THREAD_TEST_CASE(big_single_write_splitting_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 7331;
+ constexpr uint64_t write_len = default_cluster_size * 3 + min_medium_write_len + default_alignment + 10;
+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ BOOST_TEST_MESSAGE("write_len: " << write_len);
+ temporary_buffer<uint8_t> buff = default_gen_buffer(write_len, true);
+ CHECK_CALL(write_with_simulate(log, inode, write_offset, buff, real_file_data));
+
+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ auto& meta_actions = meta_buff->actions;
+ BOOST_REQUIRE_EQUAL(meta_actions.size(), 6);
+ BOOST_REQUIRE(is_append_type<ondisk_large_write>(meta_actions[1]));
+ BOOST_REQUIRE(is_append_type<ondisk_large_write_without_mtime>(meta_actions[2]));
+ BOOST_REQUIRE(is_append_type<ondisk_large_write_without_mtime>(meta_actions[3]));
+ BOOST_REQUIRE(is_append_type<ondisk_medium_write>(meta_actions[4]));
+ BOOST_REQUIRE(is_append_type<ondisk_small_write>(meta_actions[5]));
+
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb));
+
+ BOOST_TEST_MESSAGE("");
+}
+
+SEASTAR_THREAD_TEST_CASE(random_writes_and_reads_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr size_t writes_nb = 300;
+ constexpr size_t random_read_checks_nb_every_write = 30;
+ constexpr unit_size_t cluster_size = 128 * KB;
+ constexpr uint64_t max_file_size = cluster_size * 3;
+ constexpr size_t available_cluster_nb = (max_file_size * writes_nb * 2) / cluster_size;
+ BOOST_TEST_MESSAGE("available_cluster_nb: " << available_cluster_nb
+ << ", cluster_size: " << cluster_size
+ << ", writes_nb: " << writes_nb
+ << ", random_read_checks_nb_every_write: " << random_read_checks_nb_every_write);
+
+ auto [blockdev, log] = init_metadata_log(cluster_size, default_alignment, 1, {1, available_cluster_nb + 1});
+ inode_t inode = create_and_open_file(log);
+ resizable_buff_type real_file_data;
+
+ std::uniform_int_distribution<file_offset_t> offset_distr(0, max_file_size - 1);
+ std::uniform_int_distribution<int> align_distr(0, 1);

+ std::default_random_engine random_engine(testing::local_random_engine());

+ for (size_t rep = 1; rep <= writes_nb; ++rep) {
+ if (rep % (writes_nb / 10) == 0)
+ BOOST_TEST_MESSAGE("rep: " << rep << "/" << writes_nb);
+ auto a = offset_distr(random_engine);
+ auto b = offset_distr(random_engine);
+ if (a > b)
+ std::swap(a, b);
+ size_t write_size = b - a + 1;
+ bool aligned = align_distr(random_engine);
+
+ CHECK_CALL(random_write_with_simulate(log, inode, a, write_size, aligned, real_file_data));
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb_every_write));
+ }

+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt

index d2480b2e..277970e8 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -379,6 +379,10 @@ if (Seastar_EXPERIMENTAL_FS)

seastar_add_test (fs_path
KIND BOOST
SOURCES fs_path_test.cc)

+ seastar_add_test (fs_write
+ SOURCES
+ fs_write_test.cc
+ fs_mock_block_device.cc)
seastar_add_test (fs_seastarfs
SOURCES fs_seastarfs_test.cc)
seastar_add_test (fs_to_disk_buffer
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:52 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Optimization for aligned reads. When on-disk data and given buffer are
properly aligned than read disk data is not stored in a temporary
buffer but is directly read into the buffer given by the caller.

Added device_reader to perform unaligned reads with caching.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

src/fs/device_reader.hh | 91 +++++++++++
src/fs/metadata_log_operations/read.hh | 91 +++--------
src/fs/device_reader.cc | 199 +++++++++++++++++++++++++
CMakeLists.txt | 2 +
4 files changed, 312 insertions(+), 71 deletions(-)
create mode 100644 src/fs/device_reader.hh
create mode 100644 src/fs/device_reader.cc

diff --git a/src/fs/device_reader.hh b/src/fs/device_reader.hh
new file mode 100644
index 00000000..e7c486ce
--- /dev/null
+++ b/src/fs/device_reader.hh
@@ -0,0 +1,91 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#pragma once
+
+#include "fs/bitwise.hh"

+#include "fs/range.hh"
+#include "fs/units.hh"
+#include "seastar/core/file.hh"
+#include "seastar/core/future.hh"

+#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/block_device.hh"
+

+#include <cstdint>

+
+namespace seastar::fs {
+

+// Reads from unaligned offsets. Keeps simple cache to optimize consecutive reads.
+class device_reader {
+ block_device _device;
+ unit_size_t _alignment;

+ const io_priority_class& _pc;
+

+ struct disk_data {

+ disk_range _disk_range;
+ temporary_buffer<uint8_t> _data;

+ } _cached_disk_read;
+ // _cached_disk_read keeps last disk read to accelerate next disk reads in cases where next read is intersecting
+ // previous read
+
+ struct disk_data_raw {
+ disk_range _disk_range;
+ uint8_t* _data;
+
+ void trim_front(size_t count) {
+ _disk_range.beg += count;
+ _data += count;
+ }
+
+ void trim_back(size_t count) {
+ _disk_range.end -= count;
+ }
+
+ uintptr_t data_alignment(unit_size_t alignment) {
+ return mod_by_power_of_2(reinterpret_cast<uintptr_t>(_data), alignment);
+ }
+
+ size_t range_alignment(unit_size_t alignment) {
+ return mod_by_power_of_2(_disk_range.beg, alignment);
+ }
+ };
+
+ size_t copy_intersecting_data(disk_data_raw dest_data, const disk_data& source_data) const;
+
+ disk_data init_aligned_disk_data(disk_range range) const;
+
+ future<size_t> read_to_cache_and_copy_intersection(disk_range range, disk_data_raw dest_data);
+
+ class partial_read_exception : public std::exception {};
+
+ future<size_t> do_read(disk_data_raw disk_data_to_read);
+
+public:
+ device_reader(block_device& device, unit_size_t alignment, const io_priority_class& pc = default_priority_class())
+ : _device(device)
+ , _alignment(alignment)
+ , _pc(pc) {}
+
+ future<size_t> read(uint8_t* buffer, disk_offset_t disk_offset, size_t read_len);
+
+};
+
+} // namespace seastar::fs
\ No newline at end of file
diff --git a/src/fs/metadata_log_operations/read.hh b/src/fs/metadata_log_operations/read.hh
index 33d3545a..91446729 100644
--- a/src/fs/metadata_log_operations/read.hh
+++ b/src/fs/metadata_log_operations/read.hh
@@ -21,15 +21,24 @@

#pragma once

+#include "fs/cluster.hh"
+#include "fs/device_reader.hh"
#include "fs/inode.hh"
#include "fs/inode_info.hh"
-#include "fs/metadata_disk_entries.hh"
#include "fs/metadata_log.hh"
-#include "fs/range.hh"
#include "fs/units.hh"
+#include "seastar/core/do_with.hh"
+#include "seastar/core/file.hh"
#include "seastar/core/future-util.hh"
#include "seastar/core/future.hh"
-#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/exceptions.hh"

+#include "seastar/fs/overloaded.hh"
+
+#include <cstdint>

+#include <cstring>
+#include <utility>
+#include <variant>
+#include <vector>

namespace seastar::fs {

@@ -37,11 +46,15 @@ class read_operation {
metadata_log& _metadata_log;
inode_t _inode;
const io_priority_class& _pc;
+ device_reader _disk_reader;

read_operation(metadata_log& metadata_log, inode_t inode, const io_priority_class& pc)

- : _metadata_log(metadata_log), _inode(inode), _pc(pc) {}
+ : _metadata_log(metadata_log)
+ , _inode(inode)
+ , _pc(pc)
+ , _disk_reader(_metadata_log._device, _metadata_log._alignment, _pc) {}

- future<size_t> read(uint8_t* buffer, size_t read_len, file_offset_t file_offset) {
+ future<size_t> read(uint8_t* buffer, file_offset_t file_offset, size_t read_len) {
auto inode_it = _metadata_log._inodes.find(_inode);
if (inode_it == _metadata_log._inodes.end()) {
return make_exception_future<size_t>(invalid_inode_exception());
@@ -95,14 +108,6 @@ class read_operation {
});
}

- struct disk_temp_buffer {
- disk_range _disk_range;
- temporary_buffer<uint8_t> _data;
- };
- // Keep last disk read to accelerate next disk reads in cases where next read is intersecting previous disk read
- // (after alignment)
- disk_temp_buffer _prev_disk_read;
-

future<size_t> do_read(inode_data_vec& data_vec, uint8_t* buffer) {

size_t expected_read_len = data_vec.data_range.size();

@@ -116,63 +121,7 @@ class read_operation {
return make_ready_future<size_t>(expected_read_len);
},
[&](inode_data_vec::on_disk_data& disk_data) {
- // TODO: we can optimize the case when disk_data.device_offset is aligned
-
- // Copies data from source_buffer corresponding to the intersection of dest_disk_range
- // and source_buffer.disk_range into buffer. dest_disk_range.beg corresponds to first byte of buffer
- // Works when dest_disk_range.beg <= source_buffer._disk_range.beg
- auto copy_left_intersecting_data =
- [](uint8_t* buffer, disk_range dest_disk_range, const disk_temp_buffer& source_buffer) -> size_t {
- disk_range intersect = intersection(dest_disk_range, source_buffer._disk_range);
-
- assert((intersect.is_empty() or dest_disk_range.beg >= source_buffer._disk_range.beg) and
- "Beggining of source buffer on disk should be before beggining of destination buffer on disk");
-
- if (not intersect.is_empty()) {
- // We can copy _data from disk_temp_buffer
- disk_offset_t common_data_len = intersect.size();
- disk_offset_t source_data_offset = dest_disk_range.beg - source_buffer._disk_range.beg;
- // TODO: maybe we should split that memcpy to multiple parts because large reads can lead
- // to spikes in latency
- std::memcpy(buffer, source_buffer._data.get() + source_data_offset, common_data_len);
- return common_data_len;
- } else {
- return 0;
- }
- };
-
- disk_range remaining_read_range {
- disk_data.device_offset,
- disk_data.device_offset + expected_read_len
- };
-
- size_t current_read_len = 0;
- if (not _prev_disk_read._data.empty()) {
- current_read_len = copy_left_intersecting_data(buffer, remaining_read_range, _prev_disk_read);
- if (current_read_len == expected_read_len) {
- return make_ready_future<size_t>(expected_read_len);
- }
- remaining_read_range.beg += current_read_len;
- buffer += current_read_len;
- }
-
- disk_temp_buffer new_disk_read;
- new_disk_read._disk_range = {
- round_down_to_multiple_of_power_of_2(remaining_read_range.beg, _metadata_log._alignment),
- round_up_to_multiple_of_power_of_2(remaining_read_range.end, _metadata_log._alignment)
- };
- new_disk_read._data = temporary_buffer<uint8_t>::aligned(_metadata_log._alignment, new_disk_read._disk_range.size());
-
- return _metadata_log._device.read(new_disk_read._disk_range.beg, new_disk_read._data.get_write(),
- new_disk_read._disk_range.size(), _pc).then(
- [this, copy_left_intersecting_data = std::move(copy_left_intersecting_data),
- new_disk_read = std::move(new_disk_read),
- remaining_read_range, buffer, current_read_len](size_t read_len) mutable {
- new_disk_read._disk_range.end = new_disk_read._disk_range.beg + read_len;
- current_read_len += copy_left_intersecting_data(buffer, remaining_read_range, new_disk_read);
- _prev_disk_read = std::move(new_disk_read);
- return current_read_len;
- });
+ return _disk_reader.read(buffer, disk_data.device_offset, expected_read_len);
},
}, data_vec.data_location);
}
@@ -181,7 +130,7 @@ class read_operation {

static future<size_t> perform(metadata_log& metadata_log, inode_t inode, file_offset_t pos, void* buffer,

size_t len, const io_priority_class& pc) {

return do_with(read_operation(metadata_log, inode, pc), [pos, buffer, len](auto& obj) {

- return obj.read(static_cast<uint8_t*>(buffer), len, pos);
+ return obj.read(static_cast<uint8_t*>(buffer), pos, len);
});
}
};
diff --git a/src/fs/device_reader.cc b/src/fs/device_reader.cc
new file mode 100644
index 00000000..4fa357b1
--- /dev/null
+++ b/src/fs/device_reader.cc
@@ -0,0 +1,199 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#include "fs/device_reader.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/shared_ptr.hh"
+
+#include <cassert>
+#include <cstring>

+
+namespace seastar::fs {
+

+size_t device_reader::copy_intersecting_data(device_reader::disk_data_raw dest_data, const disk_data& source_data) const {
+ disk_range intersect = intersection(dest_data._disk_range, source_data._disk_range);
+
+ if (intersect.is_empty()) {

+ return 0;
+ }
+

+ disk_offset_t common_data_len = intersect.size();

+ if (dest_data._disk_range.beg >= source_data._disk_range.beg) {
+ // Source data starts before dest data
+ disk_offset_t source_data_offset = dest_data._disk_range.beg - source_data._disk_range.beg;

+ // TODO: maybe we should split that memcpy to multiple parts because large reads can lead
+ // to spikes in latency

+ std::memcpy(dest_data._data, source_data._data.get() + source_data_offset, common_data_len);
+ } else {
+ // Source data starts after dest data
+ // TODO: not tested yet
+ disk_offset_t dest_data_offset = source_data._disk_range.beg - dest_data._disk_range.beg;
+ std::memcpy(dest_data._data + dest_data_offset, source_data._data.get(), common_data_len);
+ }
+ return common_data_len;
+}
+
+device_reader::disk_data device_reader::init_aligned_disk_data(disk_range range) const {
+ device_reader::disk_data disk_read;
+ disk_read._disk_range = {
+ round_down_to_multiple_of_power_of_2(range.beg, _alignment),
+ round_up_to_multiple_of_power_of_2(range.end, _alignment)
+ };
+ disk_read._data = temporary_buffer<uint8_t>::aligned(_alignment, disk_read._disk_range.size());
+ return disk_read;
+}
+
+future<size_t> device_reader::read_to_cache_and_copy_intersection(disk_range range, device_reader::disk_data_raw dest_data) {
+ _cached_disk_read = init_aligned_disk_data(range);
+ return _device.read(_cached_disk_read._disk_range.beg, _cached_disk_read._data.get_write(),
+ _cached_disk_read._disk_range.size(), _pc).then(
+ [this, dest_data](size_t read_len) {
+ _cached_disk_read._disk_range.end = _cached_disk_read._disk_range.beg + read_len;
+ auto copied_data = copy_intersecting_data(dest_data, _cached_disk_read);
+ return copied_data;

+ });
+}
+
+

+future<size_t> device_reader::read(uint8_t* buffer, disk_offset_t disk_offset, size_t read_len) {
+ device_reader::disk_data_raw disk_data_to_read = {{disk_offset, disk_offset + read_len}, buffer};
+ if (size_t cache_read_len = 0;
+ not _cached_disk_read._data.empty() and
+ (cache_read_len = copy_intersecting_data(disk_data_to_read, _cached_disk_read)) > 0) {
+ if (cache_read_len == read_len) {
+ return make_ready_future<size_t>(read_len);
+ }
+
+ if (_cached_disk_read._disk_range.beg <= disk_data_to_read._disk_range.beg) {
+ // Left sided intersection
+ disk_data_to_read.trim_front(cache_read_len);
+ return do_read(disk_data_to_read).then([cache_read_len](size_t read_len) {
+ return cache_read_len + read_len;
+ });
+ }
+ else if (_cached_disk_read._disk_range.end >= disk_data_to_read._disk_range.end) {
+ // Left sided intersection
+ // TODO: not tested yet
+ disk_data_to_read.trim_back(cache_read_len);
+ return do_read(disk_data_to_read).then(
+ [expected_read_size = disk_data_to_read._disk_range.size(),
+ cache_read_len, disk_data_to_read](size_t read_len) {
+ return read_len + (read_len == disk_data_to_read._disk_range.size() ? 0 : cache_read_len);
+ });
+ }
+ else {
+ // Middle part intersection
+ // TODO: not tested yet
+ auto left_disk_data = disk_data_to_read;
+ disk_data_to_read.trim_back(disk_data_to_read._disk_range.end - _cached_disk_read._disk_range.beg);
+ auto right_disk_data = disk_data_to_read;
+ disk_data_to_read.trim_front(_cached_disk_read._disk_range.end - disk_data_to_read._disk_range.beg);
+
+ auto current_read_len = make_lw_shared<size_t>(0);
+ return do_read(left_disk_data).then(
+ [expected_read_size = left_disk_data._disk_range.size(),
+ cache_read_len, current_read_len](size_t read_len) {
+ *current_read_len += read_len;
+ if (read_len != expected_read_size) {
+ return make_exception_future<>(partial_read_exception());
+ }
+ *current_read_len += cache_read_len;
+ return now();
+ }).then([this, right_disk_data, current_read_len] {
+ return do_read(right_disk_data).then(
+ [expected_read_size = right_disk_data._disk_range.size(),
+ current_read_len](size_t read_len) {
+ *current_read_len += read_len;
+ if (read_len != expected_read_size) {
+ return make_exception_future<size_t>(partial_read_exception());
+ }
+ return make_ready_future<size_t>(*current_read_len);
+ });
+ }).handle_exception_type([current_read_len] (partial_read_exception&) {
+ return *current_read_len;
+ });
+ }
+ } else {
+ return do_read(disk_data_to_read);
+ }
+}
+
+future<size_t> device_reader::do_read(device_reader::disk_data_raw disk_data_to_read) {
+ auto remaining_disk_data = make_lw_shared<device_reader::disk_data_raw>(disk_data_to_read);
+ auto current_read_len = make_lw_shared<size_t>(0);
+
+ if (remaining_disk_data->data_alignment(_alignment) == remaining_disk_data->range_alignment(_alignment)) {
+ disk_range read_range {
+ round_down_to_multiple_of_power_of_2(remaining_disk_data->_disk_range.beg, _alignment),
+ round_up_to_multiple_of_power_of_2(remaining_disk_data->_disk_range.beg, _alignment)
+ };
+ // Read prefix
+ return read_to_cache_and_copy_intersection(read_range, *remaining_disk_data).then(
+ [this, remaining_disk_data, current_read_len, expected_read_len = read_range.size()](size_t copied_len) {
+ remaining_disk_data->trim_front(copied_len);
+ *current_read_len += copied_len;
+ if (_cached_disk_read._disk_range.size() != expected_read_len) {
+ return make_exception_future<>(partial_read_exception());
+ }
+ return now();
+ }).then([this, remaining_disk_data, current_read_len] {
+ // Read middle part (directly into the output buffer)
+ if (remaining_disk_data->_disk_range.is_empty()) {
+ return now();
+ } else {
+ size_t expected_read_len =
+ round_down_to_multiple_of_power_of_2(remaining_disk_data->_disk_range.size(), _alignment);
+ assert(remaining_disk_data->_disk_range.beg % _alignment == 0);
+ return _device.read(remaining_disk_data->_disk_range.beg, remaining_disk_data->_data, expected_read_len, _pc).then(
+ [remaining_disk_data, current_read_len, expected_read_len](size_t read_len) {
+
+ remaining_disk_data->trim_front(read_len);
+ *current_read_len += read_len;
+

+ if (read_len != expected_read_len) {

+ // Partial read - end here
+ return make_exception_future<>(partial_read_exception());
+ }
+
+ return now();
+ });
+ }
+ }).then([this, remaining_disk_data, current_read_len] {
+ if (remaining_disk_data->_disk_range.is_empty()) {
+ return make_ready_future<size_t>(*current_read_len);
+ } else {
+ return read_to_cache_and_copy_intersection(remaining_disk_data->_disk_range, *remaining_disk_data).then(
+ [current_read_len](size_t copied_len) {
+ return *current_read_len + copied_len;
+ });
+ }
+ }).handle_exception_type([current_read_len] (partial_read_exception&) {
+ return *current_read_len;
+ });
+ } else {
+ return read_to_cache_and_copy_intersection(remaining_disk_data->_disk_range, *remaining_disk_data).then(
+ [current_read_len](size_t copied_len) {
+ return *current_read_len + copied_len;
+ });
+ }
+}
+
+} // namespace seastar::fs
\ No newline at end of file
diff --git a/CMakeLists.txt b/CMakeLists.txt
index e4167018..f63bf1a7 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -671,6 +671,8 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/cluster_allocator.hh
src/fs/cluster_writer.hh
src/fs/crc.hh
+ src/fs/device_reader.cc
+ src/fs/device_reader.hh
src/fs/file.cc
src/fs/inode.hh
src/fs/inode_info.hh
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:52 AM4/20/20

to seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com

From: Wojciech Mitros <wmi...@protonmail.com>

Checks whether the data that will be written to disk after truncate is correct,
the reads from a truncated file are accurate and the files metadata is set
to the new size.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---

tests/unit/fs_truncate_test.cc | 171 +++++++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 4 +
2 files changed, 175 insertions(+)
create mode 100644 tests/unit/fs_truncate_test.cc

diff --git a/tests/unit/fs_truncate_test.cc b/tests/unit/fs_truncate_test.cc
new file mode 100644
index 00000000..7398d7e0
--- /dev/null
+++ b/tests/unit/fs_truncate_test.cc
@@ -0,0 +1,171 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+
+#include "fs/bitwise.hh"
+#include "fs/cluster.hh"

+#include "fs/metadata_log_operations/read.hh"
+#include "fs/metadata_log_operations/truncate.hh"

+#include "fs/metadata_log_operations/write.hh"
+#include "fs/units.hh"
+#include "fs_metadata_common.hh"
+#include "fs_mock_block_device.hh"

+#include "fs_mock_metadata_to_disk_buffer.hh"
+
+#include <seastar/core/temporary_buffer.hh>
+#include <seastar/core/units.hh>
+#include <seastar/testing/thread_test_case.hh>
+
+#include <assert.h>
+#include <cstdint>
+#include <string>
+#include <utility>
+#include <vector>
+
+using namespace seastar;
+using namespace seastar::fs;
+
+namespace {
+

+constexpr unit_size_t default_cluster_size = 1 * MB;
+constexpr unit_size_t default_alignment = 4096;
+constexpr cluster_range default_cluster_range = {1, 10};
+constexpr cluster_id_t default_metadata_log_cluster = 1;

+
+auto default_init_metadata_log() {
+ return init_metadata_log(default_cluster_size, default_alignment, default_metadata_log_cluster, default_cluster_range);

+}
+
+} // namespace

+
+SEASTAR_THREAD_TEST_CASE(test_truncate_exceptions) {

+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);

+ file_offset_t size = 3123;
+ BOOST_CHECK_THROW(log.truncate(inode+1, size).get0(), invalid_inode_exception);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_empty_truncate) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ const unix_time_t time_ns_start = get_current_time_ns();

+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);

+ file_offset_t size = 3123;
+ BOOST_TEST_MESSAGE("truncate size: " << size);
+ log.truncate(inode, size).get0();

+ auto meta_buff = get_current_metadata_buffer();
+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check metadata
+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), 2);

+ ondisk_truncate expected_entry {
+ inode,
+ size,

+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[1], expected_entry));
+

+ // Check data
+ temporary_buffer<uint8_t> buff = temporary_buffer<uint8_t>::aligned(default_alignment, round_up_to_multiple_of_power_of_2(size, default_alignment));
+ temporary_buffer<uint8_t> read_buff = temporary_buffer<uint8_t>::aligned(default_alignment, round_up_to_multiple_of_power_of_2(size, default_alignment));
+ memset(buff.get_write(), 0, size);
+ BOOST_REQUIRE_EQUAL(log.read(inode, 0, read_buff.get_write(), size).get0(), size);
+ BOOST_REQUIRE_EQUAL(memcmp(buff.get(), read_buff.get(), size), 0);
+
+ BOOST_REQUIRE_EQUAL(log.file_size(inode), size);

+
+ BOOST_TEST_MESSAGE("");
+}
+

+SEASTAR_THREAD_TEST_CASE(test_truncate_to_less) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 10;

+ const unix_time_t time_ns_start = get_current_time_ns();

+ small_write_len_t write_len = 3*default_alignment;
+ file_offset_t size = 77;

+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+

+ temporary_buffer<uint8_t> buff = temporary_buffer<uint8_t>::aligned(default_alignment, write_len);
+ memset(buff.get_write(), 'a', write_len);
+ BOOST_REQUIRE_EQUAL(log.write(inode, write_offset, buff.get(), write_len).get0(), write_len);

+ auto meta_buff = get_current_metadata_buffer();

+ auto actions_size_after_write = meta_buff->actions.size();
+ log.truncate(inode, size).get0();

+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check metadata

+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), actions_size_after_write+1);
+ ondisk_truncate expected_entry {
+ inode,
+ size,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[actions_size_after_write], expected_entry));

+
+ // Check data

+ temporary_buffer<uint8_t> read_buff = temporary_buffer<uint8_t>::aligned(default_alignment, write_len);
+ BOOST_REQUIRE_EQUAL(log.read(inode, write_offset, read_buff.get_write(), size-write_offset).get0(), size-write_offset);
+ BOOST_REQUIRE_EQUAL(memcmp(buff.get(), read_buff.get(), size-write_offset), 0);
+
+ BOOST_REQUIRE_EQUAL(log.file_size(inode), size);

+
+ BOOST_TEST_MESSAGE("");
+}
+

+SEASTAR_THREAD_TEST_CASE(test_truncate_to_more) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr file_offset_t write_offset = 10;

+ const unix_time_t time_ns_start = get_current_time_ns();

+ small_write_len_t write_len = 3*default_alignment;
+ file_offset_t size = 3*default_alignment+default_alignment/3;

+
+ auto [blockdev, log] = default_init_metadata_log();
+ inode_t inode = create_and_open_file(log);
+

+ temporary_buffer<uint8_t> buff = temporary_buffer<uint8_t>::aligned(default_alignment, write_len+default_alignment);
+ memset(buff.get_write(), 0, size-write_offset);
+ memset(buff.get_write(), 'a', write_len);
+ BOOST_REQUIRE_EQUAL(log.write(inode, write_offset, buff.get(), write_len).get0(), write_len);

+ auto meta_buff = get_current_metadata_buffer();

+ auto actions_size_after_write = meta_buff->actions.size();
+ log.truncate(inode, size).get0();

+ BOOST_TEST_MESSAGE("meta_buff->actions: " << meta_buff->actions);
+
+ // Check metadata

+ BOOST_REQUIRE_EQUAL(meta_buff->actions.size(), actions_size_after_write+1);
+ ondisk_truncate expected_entry {
+ inode,
+ size,
+ time_ns_start
+ };
+ CHECK_CALL(check_metadata_entries_equal(meta_buff->actions[actions_size_after_write], expected_entry));

+
+ // Check data

+ temporary_buffer<uint8_t> read_buff = temporary_buffer<uint8_t>::aligned(default_alignment, write_len+default_alignment);
+ BOOST_REQUIRE_EQUAL(log.read(inode, write_offset, read_buff.get_write(), size-write_offset).get0(), size-write_offset);
+ BOOST_REQUIRE_EQUAL(memcmp(buff.get(), read_buff.get(), size-write_offset), 0);
+
+ BOOST_REQUIRE_EQUAL(log.file_size(inode), size);

+
+ BOOST_TEST_MESSAGE("");
+}

diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index 277970e8..727a0b4f 100644

--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -379,6 +379,10 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_test (fs_path
KIND BOOST
SOURCES fs_path_test.cc)

+ seastar_add_test (fs_truncate
+ SOURCES
+ fs_truncate_test.cc
+ fs_mock_block_device.cc)
seastar_add_test (fs_write
SOURCES
fs_write_test.cc
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:53 AM4/20/20

to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

From: Michał Niciejewski <qup...@gmail.com>

Random test checking aligned writes and reads optimizations.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---

tests/unit/fs_write_test.cc | 81 ++++++++++++++++++++++++++++++-------
1 file changed, 67 insertions(+), 14 deletions(-)

diff --git a/tests/unit/fs_write_test.cc b/tests/unit/fs_write_test.cc
index c7e973ae..618a5e63 100644
--- a/tests/unit/fs_write_test.cc
+++ b/tests/unit/fs_write_test.cc
@@ -88,19 +88,40 @@ void random_write_with_simulate(metadata_log& log, inode_t inode, file_offset_t

write_with_simulate(log, inode, write_offset, buff, real_file_data);
}

-void check_random_reads(metadata_log& log, inode_t inode, resizable_buff_type& real_file_data, size_t reps) {
+void check_random_reads(metadata_log& log, inode_t inode, resizable_buff_type& real_file_data, size_t reps,
+ bool must_be_aligned = false) {
size_t file_size = real_file_data.size();
std::default_random_engine random_engine(testing::local_random_engine());

+ auto generate_random_value = [&](file_offset_t min, file_offset_t max) {
+ return std::uniform_int_distribution<file_offset_t>(min, max)(random_engine);
+ };
+
+ auto generate_aligned_value = [&](file_offset_t min, file_offset_t max) {
+ auto aligned_min = round_up_to_multiple_of_power_of_2(min, default_alignment);
+ auto aligned_max = round_down_to_multiple_of_power_of_2(max, default_alignment);
+ assert(aligned_min != aligned_max);
+ return std::uniform_int_distribution<file_offset_t>(
+ aligned_min / default_alignment,
+ aligned_max / default_alignment
+ )(random_engine) * default_alignment;
+ };
+
+ std::function<file_offset_t(file_offset_t, file_offset_t)> generate_value;
+ if (must_be_aligned) {
+ generate_value = generate_aligned_value;
+ } else {
+ generate_value = generate_random_value;

+ }
+
{
// Check random reads inside the file

- std::uniform_int_distribution<file_offset_t> distr(0, file_size - 1);

for (size_t rep = 0; rep < reps; ++rep) {

- auto a = distr(random_engine);
- auto b = distr(random_engine);
+ auto a = generate_value(0, file_size);
+ auto b = generate_value(0, file_size);
if (a > b)
std::swap(a, b);
- size_t max_read_size = b - a + 1;
+ size_t max_read_size = b - a;
temporary_buffer<uint8_t> read_data(max_read_size);

BOOST_REQUIRE_EQUAL(log.read(inode, a, read_data.get_write(), max_read_size).get0(), max_read_size);

BOOST_REQUIRE(std::memcmp(real_file_data.c_str() + a, read_data.get(), max_read_size) == 0);

@@ -109,13 +130,12 @@ void check_random_reads(metadata_log& log, inode_t inode, resizable_buff_type& r

{

// Check random reads outside the file

- std::uniform_int_distribution<file_offset_t> distr(file_size, 2 * file_size);

for (size_t rep = 0; rep < reps; ++rep) {

- auto a = distr(random_engine);
- auto b = distr(random_engine);
+ auto a = generate_value(file_size, 2 * file_size);
+ auto b = generate_value(file_size, 2 * file_size);
if (a > b)
std::swap(a, b);
- size_t max_read_size = b - a + 1;
+ size_t max_read_size = b - a;
temporary_buffer<uint8_t> read_data(max_read_size);

BOOST_REQUIRE_EQUAL(log.read(inode, a, read_data.get_write(), max_read_size).get0(), 0);
}

@@ -123,12 +143,10 @@ void check_random_reads(metadata_log& log, inode_t inode, resizable_buff_type& r

{

// Check random reads on the edge of the file

- std::uniform_int_distribution<file_offset_t> distra(0, file_size - 1);
- std::uniform_int_distribution<file_offset_t> distrb(file_size, 2 * file_size);

for (size_t rep = 0; rep < reps; ++rep) {

- auto a = distra(random_engine);
- auto b = distrb(random_engine);
- size_t max_read_size = b - a + 1;
+ auto a = generate_value(0, file_size);
+ auto b = generate_value(file_size, 2 * file_size);
+ size_t max_read_size = b - a;

size_t expected_read_size = file_size - a;

temporary_buffer<uint8_t> read_data(max_read_size);

BOOST_REQUIRE_EQUAL(log.read(inode, a, read_data.get_write(), max_read_size).get0(), expected_read_size);

@@ -780,3 +798,38 @@ SEASTAR_THREAD_TEST_CASE(random_writes_and_reads_test) {

CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb_every_write));
}
}
+

+SEASTAR_THREAD_TEST_CASE(aligned_writes_and_reads_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ constexpr size_t writes_nb = 300;
+ constexpr size_t random_read_checks_nb_every_write = 30;
+ constexpr unit_size_t cluster_size = 128 * KB;
+ constexpr uint64_t max_file_size = cluster_size * 3;

+ static_assert(max_file_size % default_alignment == 0);

+ constexpr size_t available_cluster_nb = (max_file_size * writes_nb * 2) / cluster_size;
+ BOOST_TEST_MESSAGE("available_cluster_nb: " << available_cluster_nb
+ << ", cluster_size: " << cluster_size
+ << ", writes_nb: " << writes_nb
+ << ", random_read_checks_nb_every_write: " << random_read_checks_nb_every_write);

+

+ auto [blockdev, log] = init_metadata_log(cluster_size, default_alignment, 1, {1, available_cluster_nb + 1});

+ inode_t inode = create_and_open_file(log);

+ resizable_buff_type real_file_data;
+
+ std::uniform_int_distribution<file_offset_t> offset_distr(0, max_file_size / default_alignment);

+ std::default_random_engine random_engine(testing::local_random_engine());
+ for (size_t rep = 1; rep <= writes_nb; ++rep) {
+ if (rep % (writes_nb / 10) == 0)
+ BOOST_TEST_MESSAGE("rep: " << rep << "/" << writes_nb);

+ file_offset_t a, b;
+ do {
+ a = offset_distr(random_engine) * default_alignment;
+ b = offset_distr(random_engine) * default_alignment;

+ if (a > b)
+ std::swap(a, b);

+ } while (a == b);

+ size_t write_size = b - a;

+ CHECK_CALL(random_write_with_simulate(log, inode, a, write_size, true, real_file_data));
+ CHECK_CALL(check_random_reads(log, inode, real_file_data, random_read_checks_nb_every_write, true));
+ }
+}
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:54 AM4/20/20

to seastar-dev@googlegroups.com, Aleksander Sorokin, sarna@scylladb.com, quport@gmail.com, wmitros@protonmail.com

From: Aleksander Sorokin <ank...@gmail.com>

Checks if there is access to the newly created directories after bootstrapping.

Signed-off-by: Aleksander Sorokin <ank...@gmail.com>
---
tests/unit/fs_log_bootstrap_test.cc | 86 +++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 4 ++
2 files changed, 90 insertions(+)
create mode 100644 tests/unit/fs_log_bootstrap_test.cc

diff --git a/tests/unit/fs_log_bootstrap_test.cc b/tests/unit/fs_log_bootstrap_test.cc
new file mode 100644
index 00000000..fe070979
--- /dev/null
+++ b/tests/unit/fs_log_bootstrap_test.cc
@@ -0,0 +1,86 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#include "fs/bootstrap_record.hh"
+#include "fs/inode.hh"
+#include "fs/metadata_log.hh"

+#include "fs/units.hh"
+#include "fs_mock_block_device.hh"
+

+#include "seastar/core/units.hh"
+#include "seastar/fs/block_device.hh"
+#include "seastar/testing/thread_test_case.hh"
+#include "seastar/util/defer.hh"
+
+using namespace seastar;
+using namespace fs;
+
+constexpr unit_size_t cluster_size = 1 * MB;
+constexpr unit_size_t alignment = 4 * KB;
+constexpr inode_t root_directory = 0;
+
+future<std::set<std::string>> get_entries_from_directory(metadata_log& log, std::string dir_path) {
+ return async([&log, dir_path = std::move(dir_path)] {
+ std::set<std::string> entries;
+ log.iterate_directory(dir_path, [&entries] (const std::string& entry) -> future<stop_iteration> {
+ entries.insert(entry);
+ return make_ready_future<stop_iteration>(stop_iteration::no);
+ }).wait();
+ return entries;
+ });
+}
+
+BOOST_TEST_DONT_PRINT_LOG_VALUE(std::set<std::string>)
+
+SEASTAR_THREAD_TEST_CASE(create_dirs_and_bootstrap_test) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ const bootstrap_record::shard_info shard_info({1, {1, 16}});
+ const std::set<std::string> control_directories = {{"dir1", "dir2", "dir3"}};

+ auto dev_impl = make_shared<mock_block_device_impl>();
+

+ {
+ block_device device(dev_impl);
+ auto log = metadata_log(std::move(device), cluster_size, alignment);
+ const auto close_log = defer([&log]() mutable { log.shutdown().wait(); });
+ log.bootstrap(root_directory, shard_info.metadata_cluster, shard_info.available_clusters, 1, 0).wait();
+
+ int flush_after = 1;
+ for (auto directory: control_directories) {
+ log.create_directory("/" + std::move(directory), file_permissions::default_file_permissions).wait();
+ if (--flush_after == 0) {
+ log.flush_log().wait();
+ }
+ }
+
+ const auto entries = get_entries_from_directory(log, "/").get0();
+ BOOST_REQUIRE_EQUAL(entries, control_directories);
+ }
+
+ {
+ block_device device(dev_impl);
+ auto log = metadata_log(std::move(device), cluster_size, alignment);
+ const auto close_log = defer([&log]() mutable { log.shutdown().wait(); });
+ log.bootstrap(root_directory, shard_info.metadata_cluster, shard_info.available_clusters, 1, 0).wait();
+
+ const auto entries = get_entries_from_directory(log, "/").get0();
+ BOOST_REQUIRE_EQUAL(entries, control_directories);
+ }

+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt

index 6ca79de5..ad49b41d 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt

@@ -372,6 +372,10 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)

+ seastar_add_test (fs_log_bootstrap
+ SOURCES
+ fs_log_bootstrap_test.cc
+ fs_mock_block_device.cc)
seastar_add_test (fs_metadata_to_disk_buffer
SOURCES
fs_metadata_to_disk_buffer_test.cc
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>

unread,

Apr 20, 2020, 8:02:55 AM4/20/20

to seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com

From: Wojciech Mitros <wmi...@protonmail.com>

For every ondisk entry check if:
- it's correctly appended to the buffer when it would fit
- the buffer returns TOO_BIG when it wouldn't fit
- it's written to disk after successful append and flush.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---

tests/unit/fs_metadata_to_disk_buffer_test.cc | 462 ++++++++++++++++++
tests/unit/CMakeLists.txt | 4 +
2 files changed, 466 insertions(+)
create mode 100644 tests/unit/fs_metadata_to_disk_buffer_test.cc

diff --git a/tests/unit/fs_metadata_to_disk_buffer_test.cc b/tests/unit/fs_metadata_to_disk_buffer_test.cc
new file mode 100644
index 00000000..ff4b402e
--- /dev/null
+++ b/tests/unit/fs_metadata_to_disk_buffer_test.cc
@@ -0,0 +1,462 @@

+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+

+#include "fs_metadata_common.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_to_disk_buffer.hh"

+#include "fs_mock_block_device.hh"
+
+#include <cstring>

+#include <seastar/core/units.hh>
+#include <seastar/fs/block_device.hh>

+#include <seastar/testing/test_case.hh>
+#include <seastar/testing/test_runner.hh>
+#include <seastar/testing/thread_test_case.hh>

+
+using namespace seastar;

+using namespace seastar::fs;
+
+namespace {
+

+constexpr auto APPENDED = metadata_to_disk_buffer::append_result::APPENDED;
+constexpr auto TOO_BIG = metadata_to_disk_buffer::append_result::TOO_BIG;
+
+constexpr unit_size_t max_siz = 1*MB;

+constexpr unit_size_t alignment = 4*KB;
+

+constexpr size_t necessary_bytes = sizeof(ondisk_type)+sizeof(ondisk_checkpoint)+sizeof(ondisk_type)+sizeof(ondisk_next_metadata_cluster);

+
+temporary_buffer<uint8_t> tmp_buff_from_string(const char* str) {
+ return temporary_buffer<uint8_t>(reinterpret_cast<const uint8_t*>(str), std::strlen(str));
+}
+

+struct test_data{
+ shared_ptr<mock_block_device_impl> dev_impl;
+ block_device dev;
+ metadata_to_disk_buffer buf;
+ temporary_buffer<uint8_t> tmp_buf;
+ test_data() {
+ dev_impl = make_shared<mock_block_device_impl>();
+ dev = block_device(dev_impl);
+ buf = metadata_to_disk_buffer();
+ buf.init(max_siz, alignment, 0);
+ tmp_buf = temporary_buffer<uint8_t>::aligned(alignment, max_siz);
+ }
+};
+
+ondisk_small_write_header fill_header(uint16_t bytes) {
+ ondisk_small_write_header fill_write {
+ 2,
+ 7,
+ static_cast<uint16_t> (bytes + 1),
+ 17
+ };
+ return fill_write;
+}
+
+constexpr auto checkpoint_type = ondisk_type::CHECKPOINT;
+
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_next_metadata_cluster) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ ondisk_next_metadata_cluster next_metadata_cluster_op {2};
+ size_t bytes_used = sizeof(ondisk_type) + sizeof(ondisk_checkpoint);
+ while (bytes_used + ondisk_entry_size(next_metadata_cluster_op) <= max_siz) {
+ BOOST_REQUIRE_EQUAL(test.buf.append(next_metadata_cluster_op), APPENDED);
+ bytes_used += ondisk_entry_size(next_metadata_cluster_op);
+ }
+ BOOST_REQUIRE_EQUAL(test.buf.append(next_metadata_cluster_op), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_next_metadata_cluster) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ ondisk_type next_metadata_cluster_type = NEXT_METADATA_CLUSTER;
+ ondisk_next_metadata_cluster next_metadata_cluster_op {2};
+ BOOST_REQUIRE_EQUAL(test.buf.append(next_metadata_cluster_op), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_next_metadata_cluster), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &next_metadata_cluster_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &next_metadata_cluster_op, sizeof(ondisk_next_metadata_cluster)), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_create_inode) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);

+ ondisk_create_inode create_inode_op {42, 1, {5, 2, 6, 8, 4}};

+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(create_inode_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(create_inode_op), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_create_inode) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ ondisk_type create_inode_type = CREATE_INODE;

+ ondisk_create_inode create_inode_op {42, 1, {5, 2, 6, 8, 4}};

+ BOOST_REQUIRE_EQUAL(test.buf.append(create_inode_op), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_create_inode), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &create_inode_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &create_inode_op, sizeof(ondisk_create_inode)), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_delete_inode) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);
+ ondisk_delete_inode delete_inode_op {1};
+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(delete_inode_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(delete_inode_op), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_delete_inode) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ ondisk_type delete_inode_type = DELETE_INODE;
+ ondisk_delete_inode delete_inode_op {1};
+ BOOST_REQUIRE_EQUAL(test.buf.append(delete_inode_op), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_delete_inode), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &delete_inode_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &delete_inode_op, sizeof(ondisk_delete_inode)), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_small_write) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);
+ ondisk_small_write_header small_write_op {
+ 2,
+ 7,
+ static_cast<uint16_t> (10),
+ 17
+ };
+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(small_write_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(small_write_op, fill_write_str.get()), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_small_write) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};

+ temporary_buffer<uint8_t> small_write_str = tmp_buff_from_string("12345");

+ ondisk_type small_write_type = SMALL_WRITE;
+ ondisk_small_write_header small_write_op {

+ 2,
+ 7,
+ static_cast<uint16_t>(small_write_str.size()),
+ 17

+ };
+ BOOST_REQUIRE_EQUAL(test.buf.append(small_write_op, small_write_str.get()), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_small_write_header) + small_write_str.size(), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &small_write_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &small_write_op, sizeof(ondisk_small_write_header)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type) + sizeof(ondisk_small_write_header), small_write_str.get(), small_write_str.size()), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_medium_write) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);

+ ondisk_medium_write medium_write_op {1, 8, 4, 6, 9};

+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(medium_write_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(medium_write_op), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_medium_write) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ ondisk_type medium_write_type = MEDIUM_WRITE;

+ ondisk_medium_write medium_write_op {1, 8, 4, 6, 9};

+ BOOST_REQUIRE_EQUAL(test.buf.append(medium_write_op), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_medium_write), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &medium_write_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &medium_write_op, sizeof(ondisk_medium_write)), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_large_write) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);

+ ondisk_large_write large_write_op {6, 8, 2, 5};

+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(large_write_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(large_write_op), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_large_write) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ ondisk_type large_write_type = LARGE_WRITE;

+ ondisk_large_write large_write_op {6, 8, 2, 5};

+ BOOST_REQUIRE_EQUAL(test.buf.append(large_write_op), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_large_write), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &large_write_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &large_write_op, sizeof(ondisk_large_write)), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_large_write_without_mtime) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);
+ ondisk_large_write_without_mtime large_write_without_mtime_op {256, 88, 11};
+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(large_write_without_mtime_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(large_write_without_mtime_op), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_large_write_without_mtime) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ ondisk_type large_write_without_mtime_type = LARGE_WRITE_WITHOUT_MTIME;
+ ondisk_large_write_without_mtime large_write_without_mtime_op {256, 88, 11};
+ BOOST_REQUIRE_EQUAL(test.buf.append(large_write_without_mtime_op), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_large_write_without_mtime), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &large_write_without_mtime_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &large_write_without_mtime_op, sizeof(ondisk_large_write_without_mtime)), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_truncate) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);

+ ondisk_truncate truncate_op {64, 28, 62};

+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(truncate_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(truncate_op), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_truncate) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ ondisk_type truncate_type = TRUNCATE;

+ ondisk_truncate truncate_op {64, 28, 62};

+ BOOST_REQUIRE_EQUAL(test.buf.append(truncate_op), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_truncate), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &truncate_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &truncate_op, sizeof(ondisk_truncate)), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_add_dir_entry) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);
+ ondisk_add_dir_entry_header add_dir_entry_op {
+ 2,
+ 7,
+ static_cast<uint16_t>(10)
+ };
+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(add_dir_entry_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(add_dir_entry_op, fill_write_str.get()), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_add_dir_entry) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};

+ temporary_buffer<uint8_t> add_dir_entry_str = tmp_buff_from_string("120345");

+ ondisk_type add_dir_entry_type = ADD_DIR_ENTRY;
+ ondisk_add_dir_entry_header add_dir_entry_op {

+ 2,
+ 7,
+ static_cast<uint16_t>(add_dir_entry_str.size())

+ };
+ BOOST_REQUIRE_EQUAL(test.buf.append(add_dir_entry_op, add_dir_entry_str.get()), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_add_dir_entry_header) + add_dir_entry_str.size(), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &add_dir_entry_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &add_dir_entry_op, sizeof(ondisk_add_dir_entry_header)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type) + sizeof(ondisk_add_dir_entry_header), add_dir_entry_str.get(), add_dir_entry_str.size()), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_create_inode_as_dir_entry) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);
+ ondisk_create_inode_as_dir_entry_header create_inode_as_dir_entry_op {

+ {
+ 42,
+ 1,
+ {5, 2, 6, 8, 4}
+ },
+ 7,

+ static_cast<uint16_t>(10)
+ };
+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(create_inode_as_dir_entry_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(create_inode_as_dir_entry_op, fill_write_str.get()), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_create_inode_as_dir_entry) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ temporary_buffer<uint8_t> create_inode_as_dir_entry_str = tmp_buff_from_string("120345");
+ ondisk_type create_inode_as_dir_entry_type = CREATE_INODE_AS_DIR_ENTRY;
+ ondisk_create_inode_as_dir_entry_header create_inode_as_dir_entry_op {

+ {
+ 42,
+ 1,
+ {5, 2, 6, 8, 4}
+ },
+ 7,

+ static_cast<uint16_t>(create_inode_as_dir_entry_str.size())
+ };
+ BOOST_REQUIRE_EQUAL(test.buf.append(create_inode_as_dir_entry_op, create_inode_as_dir_entry_str.get()), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_create_inode_as_dir_entry_header) + create_inode_as_dir_entry_str.size(), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &create_inode_as_dir_entry_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &create_inode_as_dir_entry_op, sizeof(ondisk_create_inode_as_dir_entry_header)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type) + sizeof(ondisk_create_inode_as_dir_entry_header),
+ create_inode_as_dir_entry_str.get(), create_inode_as_dir_entry_str.size()), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_delete_dir_entry) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);
+ ondisk_delete_dir_entry_header delete_dir_entry_op {
+ 42,
+ static_cast<uint16_t>(9)
+ };
+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(delete_dir_entry_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(delete_dir_entry_op, fill_write_str.get()), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_delete_dir_entry) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ temporary_buffer<uint8_t> delete_dir_entry_str = tmp_buff_from_string("120345");
+ ondisk_type delete_dir_entry_type = DELETE_DIR_ENTRY;
+ ondisk_delete_dir_entry_header delete_dir_entry_op {
+ 42,
+ static_cast<uint16_t>(delete_dir_entry_str.size())
+ };
+ BOOST_REQUIRE_EQUAL(test.buf.append(delete_dir_entry_op, delete_dir_entry_str.get()), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_delete_dir_entry_header) + delete_dir_entry_str.size(), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &delete_dir_entry_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &delete_dir_entry_op, sizeof(ondisk_delete_dir_entry_header)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type) + sizeof(ondisk_delete_dir_entry_header),
+ delete_dir_entry_str.get(), delete_dir_entry_str.size()), 0);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_too_big_delete_inode_and_dir_entry) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ test.buf.init_from_bootstrapped_cluster(max_siz, alignment, 0, max_siz-alignment);
+ ondisk_delete_inode_and_dir_entry_header delete_inode_and_dir_entry_op {
+ 42,
+ 24,
+ static_cast<uint16_t>(11)
+ };
+ auto fill_write_op = fill_header(alignment - necessary_bytes - sizeof(ondisk_type) - sizeof(ondisk_small_write_header) - ondisk_entry_size(delete_inode_and_dir_entry_op) + 1);
+ auto fill_write_str = temporary_buffer<uint8_t>(fill_write_op.length);
+ std::memset(fill_write_str.get_write(), fill_write_op.length, 'a');
+ BOOST_REQUIRE_EQUAL(test.buf.append(fill_write_op, fill_write_str.get()), APPENDED);
+ BOOST_REQUIRE_EQUAL(test.buf.append(delete_inode_and_dir_entry_op, fill_write_str.get()), TOO_BIG);
+}
+
+SEASTAR_THREAD_TEST_CASE(test_delete_inode_and_dir_entry) {

+ BOOST_TEST_MESSAGE("\nTest name: " << get_name());

+ test_data test{};
+ temporary_buffer<uint8_t> delete_inode_and_dir_entry_str = tmp_buff_from_string("120345");
+ ondisk_type delete_inode_and_dir_entry_type = DELETE_INODE_AND_DIR_ENTRY;
+ ondisk_delete_inode_and_dir_entry_header delete_inode_and_dir_entry_op {
+ 42,
+ 24,
+ static_cast<uint16_t>(delete_inode_and_dir_entry_str.size())
+ };
+ BOOST_REQUIRE_EQUAL(test.buf.append(delete_inode_and_dir_entry_op, delete_inode_and_dir_entry_str.get()), APPENDED);
+ test.buf.flush_to_disk(test.dev).get();
+ disk_offset_t len_aligned = round_up_to_multiple_of_power_of_2(sizeof(ondisk_type) + sizeof(ondisk_delete_inode_and_dir_entry_header) + delete_inode_and_dir_entry_str.size(), alignment);
+ test.dev.read<uint8_t>(0, test.tmp_buf.get_write(), len_aligned).get();
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &checkpoint_type, sizeof(ondisk_type)), 0);
+ test.tmp_buf.trim_front(sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get(), &delete_inode_and_dir_entry_type, sizeof(ondisk_type)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type), &delete_inode_and_dir_entry_op, sizeof(ondisk_delete_inode_and_dir_entry_header)), 0);
+ BOOST_REQUIRE_EQUAL(std::memcmp(test.tmp_buf.get() + sizeof(ondisk_type) + sizeof(ondisk_delete_inode_and_dir_entry_header),
+ delete_inode_and_dir_entry_str.get(), delete_inode_and_dir_entry_str.size()), 0);

+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt

index 727a0b4f..6ca79de5 100644

--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -372,6 +372,10 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)

+ seastar_add_test (fs_metadata_to_disk_buffer
+ SOURCES
+ fs_metadata_to_disk_buffer_test.cc
+ fs_mock_block_device.cc)
seastar_add_test (fs_mock_metadata_to_disk_buffer
SOURCES
fs_mock_metadata_to_disk_buffer_test.cc
--
2.26.1

Alexander Gallego

<alex@vectorized.io>

unread,

Apr 22, 2020, 12:52:58 AM4/22/20

to seastar-dev

On Monday, April 20, 2020 at 5:02:18 AM UTC-7, Krzysztof Małysa wrote:

github: https://github.com/psarna/seastar/commits/fs-metadata-log

This series is part of the ZPP FS project that is coordinated by Piotr Sarna <sa...@scylladb.com>.
The goal of this project is to create SeastarFS -- a fully asynchronous, sharded, user-space,
log-structured file system that is intended to become an alternative to XFS for Scylla.

The filesystem is optimized for:
- NVMe SSD storage
- large files
- appending files

Hi Krzysztof! very cool change. I'll have to dig through this in more detail. I wanted to ask/present

an argument as you are designing and getting feedback to expose the block device interface

as a separate public interface (the filesystem would depend on it).

The reason, is that at vectorized (vectorized.io/redpanda) we basically have just large files that

we append to, and because of protocol compatibility with Kafka, we have everything we can think of

to provide error detection (header checksum and per append/blob checksum), size, timestamps, etc.

Ideally, I'd love to play with a the block allocator ourselves.

Roughly speaking, we have co-designed the writer with the reader. i.e.: we rely on header checksum

at the record level and simply fallocate the files upfront (actually adaptively) and have other higher

level metadata about what should be in the file (the raft log tells us the offset that is visible publicly).

Raft, under poor network conditions will actually issue a lot of truncation operations to the underlying file,

but by and large, it would fit your ideal use-case as well.

Do you have a higher level design document (I'm cool reading the code if not). I'd like to understand exactly

how to manipulate the proposed 'big data log' section below.

For efficiency all metadata is stored in the memory. Metadata holds information about where each
part of the file is located and about the directory tree.

Whole filesystem is divided into filesystem shards (typically the same as number of seastar shards
for efficiency). Each shard holds its own part of the filesystem. Sharding is done by the fact that
each shard has its set of root subdirectories that it exclusively owns (one of the shards is an
owner of the root directory itself).

Every shard will have 3 private logs:

- metadata log -- holds metadata and very small writes

Can you define small writes - is it 100 bytes or 4KB?

- medium data log -- holds medium-sized writes, which can combine data from different files

64KB/128KB ? what are the ranges for these sizes.

- big data log -- holds data clusters, each of which belongs to a single file (this is not

actually a log, but in the big picture it looks like it was)

Disk space is divided into clusters (typically around several MiB) that
have all equal size that is multiple of alignment (typically 4096
bytes). Each shard has its private pool of clusters (assignment is
stored in bootstrap record). Each log consumes clusters one by one -- it
writes the current one and if cluster becomes full, then log switches to
a new one that is obtained from a pool of free clusters managed by
cluster_allocator. Metadata log and medium data log write data in the
same manner: they fill up the cluster gradually from left to right. Big
data log takes a cluster and completely fills it with data at once -- it
is only used during big writes.

Any insight as to how this will integrate w/ the io_priority classes and subsystem? Any major behavior changes there?

metadata_log is in fact a standalone file system instance that provides lower level interface
(paths and inodes) of shard's own part of the filesystem. It manages all 3 logs mentioned above and
maintains all metadata about its part of the file system that include data structures for directory
structure and file content, locking logic for safe concurrent usage, buffers for writing logs to
disk, and bootstrapping -- restoring file system structure from disk.

This patch implements:
- bootstrap record -- our equivalent of the filesystem superblock -- it contains information like
size of the block device, number of filesystem shards, cluster distribution among shards
- cluster allocator for managing free clusters within one metadata_log
- fully functional metadata_log that will be one shard's part of the filesystem
- bootstrapping metadata_log
- creating / deleting file and directories (+ support for unlinked files)
- reading, writing and truncating files
- opening and closing files
- linking files (but not directories)
- iterating directory and getting file attributes
- tests of some components and functionality of the metadata_log and bootstrap record

What is not here, but we plan on pushing it later:
- compaction
- filesystem sharding
- renaming files

Tests: unit(dev)

Aleksander Sorokin (3):
fs: add initial file implementation
tests: fs: add parallel i/o unit test for seastarfs file
tests: fs: add basic test for metadata log bootstrapping

Krzysztof Małysa (14):
fs: add initial block_device implementation
fs: add temporary_file
tests: fs: add block_device unit test
fs: add unit headers
fs: add seastar/fs/overloaded.hh
fs: add seastar/fs/path.hh with unit tests
fs: add value_shared_lock.hh
fs: metadata_log: add base implementation
fs: metadata_log: add operation for creating and opening unlinked file
fs: metadata_log: add creating files and directories
fs: metadata_log: add private operation for deleting inode
fs: metadata_log: add link operation
fs: metadata_log: add unlinking files and removing directories
fs: metadata_log: add stat() operation

Michał Niciejewski (10):
fs: add bootstrap record implementation
tests: fs: add tests for bootstrap record
fs: metadata_log: add opening file
fs: metadata_log: add closing file
fs: metadata_log: add write operation
fs: metadata_log: add read operation
tests: fs: added metadata_to_disk_buffer and cluster_writer mockers
tests: fs: add write test
fs: read: add optimization for aligned reads
tests: fs: add tests for aligned reads and writes

Piotr Sarna (1):
fs: prepare fs/ directory and conditional compilation

Wojciech Mitros (6):
fs: add cluster allocator
fs: add cluster allocator tests
fs: metadata_log: add truncate operation
tests: fs: add to_disk_buffer test
tests: fs: add truncate operation test
tests: fs: add metadata_to_disk_buffer unit tests

configure.py | 6 +
include/seastar/fs/block_device.hh | 102 +++
include/seastar/fs/exceptions.hh | 88 ++
include/seastar/fs/file.hh | 55 ++
include/seastar/fs/overloaded.hh | 26 +
include/seastar/fs/stat.hh | 41 +
include/seastar/fs/temporary_file.hh | 54 ++
src/fs/bitwise.hh | 125 +++
src/fs/bootstrap_record.hh | 98 ++
src/fs/cluster.hh | 42 +
src/fs/cluster_allocator.hh | 50 ++
src/fs/cluster_writer.hh | 85 ++
src/fs/crc.hh | 34 +
src/fs/device_reader.hh | 91 ++
src/fs/inode.hh | 80 ++
src/fs/inode_info.hh | 221 +++++
src/fs/metadata_disk_entries.hh | 208 +++++
src/fs/metadata_log.hh | 362 ++++++++
src/fs/metadata_log_bootstrap.hh | 145 +++
.../create_and_open_unlinked_file.hh | 77 ++
src/fs/metadata_log_operations/create_file.hh | 174 ++++
src/fs/metadata_log_operations/link_file.hh | 112 +++
src/fs/metadata_log_operations/read.hh | 138 +++
src/fs/metadata_log_operations/truncate.hh | 90 ++
.../unlink_or_remove_file.hh | 196 ++++
src/fs/metadata_log_operations/write.hh | 318 +++++++
src/fs/metadata_to_disk_buffer.hh | 244 +++++
src/fs/path.hh | 42 +
src/fs/range.hh | 61 ++
src/fs/to_disk_buffer.hh | 138 +++
src/fs/units.hh | 40 +
src/fs/unix_metadata.hh | 40 +
src/fs/value_shared_lock.hh | 65 ++
tests/unit/fs_metadata_common.hh | 467 ++++++++++
tests/unit/fs_mock_block_device.hh | 55 ++
tests/unit/fs_mock_cluster_writer.hh | 78 ++
tests/unit/fs_mock_metadata_to_disk_buffer.hh | 323 +++++++
src/fs/bootstrap_record.cc | 206 +++++
src/fs/cluster_allocator.cc | 54 ++
src/fs/device_reader.cc | 199 +++++
src/fs/file.cc | 108 +++
src/fs/metadata_log.cc | 525 +++++++++++
src/fs/metadata_log_bootstrap.cc | 552 ++++++++++++
tests/unit/fs_block_device_test.cc | 206 +++++
tests/unit/fs_bootstrap_record_test.cc | 414 +++++++++
tests/unit/fs_cluster_allocator_test.cc | 115 +++
tests/unit/fs_log_bootstrap_test.cc | 86 ++
tests/unit/fs_metadata_to_disk_buffer_test.cc | 462 ++++++++++
tests/unit/fs_mock_block_device.cc | 50 ++
.../fs_mock_metadata_to_disk_buffer_test.cc | 357 ++++++++
tests/unit/fs_path_test.cc | 90 ++
tests/unit/fs_seastarfs_test.cc | 62 ++
tests/unit/fs_to_disk_buffer_test.cc | 160 ++++
tests/unit/fs_truncate_test.cc | 171 ++++
tests/unit/fs_write_test.cc | 835 ++++++++++++++++++
CMakeLists.txt | 50 ++
src/fs/README.md | 10 +
tests/unit/CMakeLists.txt | 42 +
58 files changed, 9325 insertions(+)
create mode 100644 include/seastar/fs/block_device.hh
create mode 100644 include/seastar/fs/exceptions.hh
create mode 100644 include/seastar/fs/file.hh
create mode 100644 include/seastar/fs/overloaded.hh
create mode 100644 include/seastar/fs/stat.hh
create mode 100644 include/seastar/fs/temporary_file.hh
create mode 100644 src/fs/bitwise.hh
create mode 100644 src/fs/bootstrap_record.hh
create mode 100644 src/fs/cluster.hh
create mode 100644 src/fs/cluster_allocator.hh
create mode 100644 src/fs/cluster_writer.hh
create mode 100644 src/fs/crc.hh
create mode 100644 src/fs/device_reader.hh
create mode 100644 src/fs/inode.hh

create mode 100644 src/fs/inode_info.hh
create mode 100644 src/fs/metadata_disk_entries.hh
create mode 100644 src/fs/metadata_log.hh
create mode 100644 src/fs/metadata_log_bootstrap.hh

create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
create mode 100644 src/fs/metadata_log_operations/create_file.hh
create mode 100644 src/fs/metadata_log_operations/link_file.hh
create mode 100644 src/fs/metadata_log_operations/read.hh
create mode 100644 src/fs/metadata_log_operations/truncate.hh
create mode 100644 src/fs/metadata_log_operations/unlink_or_remove_file.hh
create mode 100644 src/fs/metadata_log_operations/write.hh
create mode 100644 src/fs/metadata_to_disk_buffer.hh
create mode 100644 src/fs/path.hh
create mode 100644 src/fs/range.hh
create mode 100644 src/fs/to_disk_buffer.hh
create mode 100644 src/fs/units.hh
create mode 100644 src/fs/unix_metadata.hh
create mode 100644 src/fs/value_shared_lock.hh
create mode 100644 tests/unit/fs_metadata_common.hh
create mode 100644 tests/unit/fs_mock_block_device.hh

create mode 100644 tests/unit/fs_mock_cluster_writer.hh
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer.hh

create mode 100644 src/fs/bootstrap_record.cc
create mode 100644 src/fs/cluster_allocator.cc
create mode 100644 src/fs/device_reader.cc
create mode 100644 src/fs/file.cc

create mode 100644 src/fs/metadata_log.cc
create mode 100644 src/fs/metadata_log_bootstrap.cc

create mode 100644 tests/unit/fs_block_device_test.cc
create mode 100644 tests/unit/fs_bootstrap_record_test.cc
create mode 100644 tests/unit/fs_cluster_allocator_test.cc
create mode 100644 tests/unit/fs_log_bootstrap_test.cc
create mode 100644 tests/unit/fs_metadata_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_mock_block_device.cc
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_path_test.cc
create mode 100644 tests/unit/fs_seastarfs_test.cc
create mode 100644 tests/unit/fs_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_truncate_test.cc
create mode 100644 tests/unit/fs_write_test.cc
create mode 100644 src/fs/README.md

--
2.26.1

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 5:20:33 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:
> github: https://github.com/psarna/seastar/commits/fs-metadata-log
>
> This series is part of the ZPP FS project that is coordinated by Piotr Sarna <sa...@scylladb.com>.
> The goal of this project is to create SeastarFS -- a fully asynchronous, sharded, user-space,
> log-structured file system that is intended to become an alternative to XFS for Scylla.

Not just Scylla, any Seastar application.

> The filesystem is optimized for:
> - NVMe SSD storage
> - large files
> - appending files
>

> For efficiency all metadata is stored in the memory. Metadata holds information about where each
> part of the file is located and about the directory tree.
>
> Whole filesystem is divided into filesystem shards (typically the same as number of seastar shards
> for efficiency). Each shard holds its own part of the filesystem. Sharding is done by the fact that
> each shard has its set of root subdirectories that it exclusively owns (one of the shards is an
> owner of the root directory itself).
>
> Every shard will have 3 private logs:
> - metadata log -- holds metadata and very small writes

> - medium data log -- holds medium-sized writes, which can combine data from different files

> - big data log -- holds data clusters, each of which belongs to a single file (this is not
> actually a log, but in the big picture it looks like it was)

> Disk space is divided into clusters (typically around several MiB) that
> have all equal size that is multiple of alignment (typically 4096
> bytes). Each shard has its private pool of clusters (assignment is
> stored in bootstrap record). Each log consumes clusters one by one -- it
> writes the current one and if cluster becomes full, then log switches to
> a new one that is obtained from a pool of free clusters managed by
> cluster_allocator. Metadata log and medium data log write data in the
> same manner: they fill up the cluster gradually from left to right. Big
> data log takes a cluster and completely fills it with data at once -- it
> is only used during big writes.
>

> metadata_log is in fact a standalone file system instance that provides lower level interface
> (paths and inodes) of shard's own part of the filesystem. It manages all 3 logs mentioned above and
> maintains all metadata about its part of the file system that include data structures for directory
> structure and file content, locking logic for safe concurrent usage, buffers for writing logs to
> disk, and bootstrapping -- restoring file system structure from disk.
>
> This patch implements:
> - bootstrap record -- our equivalent of the filesystem superblock -- it contains information like
> size of the block device, number of filesystem shards, cluster distribution among shards
> - cluster allocator for managing free clusters within one metadata_log
> - fully functional metadata_log that will be one shard's part of the filesystem
> - bootstrapping metadata_log
> - creating / deleting file and directories (+ support for unlinked files)
> - reading, writing and truncating files
> - opening and closing files
> - linking files (but not directories)
> - iterating directory and getting file attributes
> - tests of some components and functionality of the metadata_log and bootstrap record
>
> What is not here, but we plan on pushing it later:
> - compaction
> - filesystem sharding
> - renaming files

I'd like to see in-tree documentation, especially of the on-disk format,
but also of the algorithms and in-memory data structures.

> Tests: unit(dev)

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 5:22:52 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, Piotr Sarna, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> From: Piotr Sarna <sa...@scylladb.com>
>
> This patch provides the initial infrastructure for future
> SeastarFS (Seastar filesystem) patches.
> Since the project is in very early stage and is going to require
> C++17 features, it's disabled by default and can only be enabled
> manually by configuring with --enable-experimental-fs
> or defining a CMake flag -DSeastar_EXPERIMENTAL_FS=ON.

It is fine for new features to depend on C++17 (and even C++20 now),
provided the feature is disabled when an older dialect is used. This
means you may use coroutines in the implementation.

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 5:27:00 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> block_device is an abstraction over an opened block device file or
> opened ordinary file of fixed size. It offers:
> - openning and closing file (block device)
> - aligned reads and writes
> - flushing

>
> Signed-off-by: Krzysztof Małysa <var...@gmail.com>
> ---

> include/seastar/fs/block_device.hh | 102 +++++++++++++++++++++++++++++
> CMakeLists.txt | 1 +
> 2 files changed, 103 insertions(+)
> create mode 100644 include/seastar/fs/block_device.hh
>
> diff --git a/include/seastar/fs/block_device.hh b/include/seastar/fs/block_device.hh
> new file mode 100644
> index 00000000..31822617
> --- /dev/null
> +++ b/include/seastar/fs/block_device.hh
> @@ -0,0 +1,102 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> + * Copyright (C) 2019 ScyllaDB

I don't believe we have a copyright assignment in place, so you should
retain the copyright here.

> + */
> +
> +#pragma once
> +

> +#include "seastar/core/file.hh"
> +#include "seastar/core/reactor.hh"

> +
> +namespace seastar::fs {
> +

> +class block_device_impl {
> +public:
> + virtual ~block_device_impl() = default;
> +
> + virtual future<size_t> write(uint64_t aligned_pos, const void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) = 0;
> + virtual future<size_t> read(uint64_t aligned_pos, void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) = 0;
> + virtual future<> flush() = 0;
> + virtual future<> close() = 0;
> +};
> +
> +class block_device {
> + shared_ptr<block_device_impl> _block_device_impl;
> +public:

What's wrong with using seastar::file? We already support opening a
block device as a file. And in fact I see below that you implement it on
top of seastar::file.

If you plan an additional implementation, you should explain it in the
changelog.

> + block_device(shared_ptr<block_device_impl> impl) noexcept : _block_device_impl(std::move(impl)) {}
> +
> + block_device() = default;
> +
> + block_device(const block_device&) = default;
> + block_device(block_device&&) noexcept = default;
> + block_device& operator=(const block_device&) noexcept = default;
> + block_device& operator=(block_device&&) noexcept = default;
> +
> + explicit operator bool() const noexcept { return bool(_block_device_impl); }
> +
> + template <typename CharType>
> + future<size_t> read(uint64_t aligned_offset, CharType* aligned_buffer, size_t aligned_len, const io_priority_class& pc = default_priority_class()) {
> + return _block_device_impl->read(aligned_offset, aligned_buffer, aligned_len, pc);
> + }
> +
> + template <typename CharType>
> + future<size_t> write(uint64_t aligned_offset, const CharType* aligned_buffer, size_t aligned_len, const io_priority_class& pc = default_priority_class()) {
> + return _block_device_impl->write(aligned_offset, aligned_buffer, aligned_len, pc);
> + }
> +
> + future<> flush() {
> + return _block_device_impl->flush();
> + }
> +
> + future<> close() {
> + return _block_device_impl->close();
> + }
> +};
> +
> +class file_block_device_impl : public block_device_impl {
> + file _file;
> +public:
> + explicit file_block_device_impl(file f) : _file(std::move(f)) {}
> +
> + ~file_block_device_impl() override = default;
> +
> + future<size_t> write(uint64_t aligned_pos, const void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) override {
> + return _file.dma_write(aligned_pos, aligned_buffer, aligned_len, pc);
> + }
> +
> + future<size_t> read(uint64_t aligned_pos, void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) override {
> + return _file.dma_read(aligned_pos, aligned_buffer, aligned_len, pc);
> + }
> +
> + future<> flush() override {
> + return _file.flush();
> + }
> +
> + future<> close() override {
> + return _file.close();
> + }
> +};
> +
> +inline future<block_device> open_block_device(std::string name) {
> + return open_file_dma(std::move(name), open_flags::rw).then([](file f) {
> + return block_device(make_shared<file_block_device_impl>(std::move(f)));

> + });
> +}
> +
> +}

> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index be4f02c8..b50abf99 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -657,6 +657,7 @@ if (Seastar_EXPERIMENTAL_FS)
> target_sources(seastar

> PRIVATE
> # SeastarFS source files

> + include/seastar/fs/block_device.hh
> )
> endif()
>

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 5:32:21 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> temporary_file is a handle to a temprorary file with a path.
> It creates temporary file upon construction and removes it upon
> destruction.
> The main use case is testing the file system on a temporary file that
> simulates a block device.

>
> Signed-off-by: Krzysztof Małysa <var...@gmail.com>
> ---

> include/seastar/fs/temporary_file.hh | 54 ++++++++++++++++++++++++++++
> CMakeLists.txt | 1 +
> 2 files changed, 55 insertions(+)
> create mode 100644 include/seastar/fs/temporary_file.hh
>
> diff --git a/include/seastar/fs/temporary_file.hh b/include/seastar/fs/temporary_file.hh
> new file mode 100644
> index 00000000..c00282d9
> --- /dev/null
> +++ b/include/seastar/fs/temporary_file.hh
> @@ -0,0 +1,54 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> + */
> +
> +#pragma once
> +

> +#include "seastar/core/posix.hh"
> +
> +#include <string>

> +
> +namespace seastar::fs {
> +

> +class temporary_file {
> + std::string _path;

This is infrastructure and not intended to be used by users of the
filesystem, yes? So it shouldn't be in include/seastar which is a public
header. In fact all fs headers should be private, since the main
interface is still seastar::file.

Note we already have a tmp_file class (temporary_file is a better name
though).

> +
> +public:
> + explicit temporary_file(std::string path) : _path(std::move(path) + ".XXXXXX") {
> + int fd = mkstemp(_path.data());
> + throw_system_error_on(fd == -1);
> + close(fd);
> + }
> +
> + ~temporary_file() {
> + unlink(_path.data());
> + }
> +
> + temporary_file(const temporary_file&) = delete;
> + temporary_file& operator=(const temporary_file&) = delete;
> + temporary_file(temporary_file&&) noexcept = delete;
> + temporary_file& operator=(temporary_file&&) noexcept = delete;
> +
> + const std::string& path() const noexcept {
> + return _path;
> + }
> +};
> +
> +} // namespace seastar::fs
> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index 0ba7ee35..39d11ad8 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -659,6 +659,7 @@ if (Seastar_EXPERIMENTAL_FS)

> # SeastarFS source files
> include/seastar/fs/block_device.hh

> include/seastar/fs/file.hh
> + include/seastar/fs/temporary_file.hh
> src/fs/file.cc
> )
> endif()

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 5:36:59 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, Aleksander Sorokin, sarna@scylladb.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> From: Aleksander Sorokin <ank...@gmail.com>
>
> Added first crude unit test for seastarfs_file_impl:
> paralleel writing with newly created handle and reading the same data back.

>
> Signed-off-by: Aleksander Sorokin <ank...@gmail.com>
> ---

> tests/unit/fs_seastarfs_test.cc | 62 +++++++++++++++++++++++++++++++++
> tests/unit/CMakeLists.txt | 5 +++
> 2 files changed, 67 insertions(+)
> create mode 100644 tests/unit/fs_seastarfs_test.cc
>
> diff --git a/tests/unit/fs_seastarfs_test.cc b/tests/unit/fs_seastarfs_test.cc
> new file mode 100644
> index 00000000..25c3e8d5
> --- /dev/null
> +++ b/tests/unit/fs_seastarfs_test.cc
> @@ -0,0 +1,62 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> + * Copyright (C) 2019 ScyllaDB
> + */
> +
> +#include "seastar/core/aligned_buffer.hh"
> +#include "seastar/core/file-types.hh"
> +#include "seastar/core/file.hh"
> +#include "seastar/core/thread.hh"
> +#include "seastar/core/units.hh"
> +#include "seastar/fs/file.hh"
> +#include "seastar/fs/temporary_file.hh"

> +#include "seastar/testing/thread_test_case.hh"
> +
> +using namespace seastar;

> +using namespace fs;
> +
> +constexpr auto device_path = "/tmp/seastarfs";
> +constexpr auto device_size = 16 * MB;
> +
> +SEASTAR_THREAD_TEST_CASE(parallel_read_write_test) {
> + const auto tf = temporary_file(device_path);
> + auto f = fs::open_file_dma(tf.path(), open_flags::rw).get0();
> + static auto alignment = f.memory_dma_alignment();
> +
> + parallel_for_each(boost::irange<off_t>(0, device_size / alignment), [&f](auto i) {

Please avoid generic lambdas. The problem is that the entire code below
becomes a template, so IDEs that do static analysis become useless. For
example, the IDE cannot do any checks on the call to std::fill() below,
since it has no idea what the type of i is. This gets compounded as you
nest deeper.

> + auto wbuf = allocate_aligned_buffer<unsigned char>(alignment, alignment);
> + std::fill(wbuf.get(), wbuf.get() + alignment, i);
> + auto wb = wbuf.get();
> +
> + return f.dma_write(i * alignment, wb, alignment).then(
> + [&f, i, wbuf = std::move(wbuf)](auto ret) mutable {
> + BOOST_REQUIRE_EQUAL(ret, alignment);
> + auto rbuf = allocate_aligned_buffer<unsigned char>(alignment, alignment);
> + auto rb = rbuf.get();
> + return f.dma_read(i * alignment, rb, alignment).then(
> + [f, rbuf = std::move(rbuf), wbuf = std::move(wbuf)](auto ret) {
> + BOOST_REQUIRE_EQUAL(ret, alignment);
> + BOOST_REQUIRE(std::equal(rbuf.get(), rbuf.get() + alignment, wbuf.get()));
> + });
> + });
> + }).wait();
> +
> + f.flush().wait();
> + f.close().wait();
> +

You should use get() instead of wait() in threads, in such situations.
wait() waits for the future to become ready, but does not check if it
failed. This means failures can go unnoticed.

> }
> diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
> index 8f203721..f2c5187f 100644
> --- a/tests/unit/CMakeLists.txt
> +++ b/tests/unit/CMakeLists.txt
> @@ -361,6 +361,11 @@ seastar_add_test (rpc
> loopback_socket.hh
> rpc_test.cc)
>
> +if (Seastar_EXPERIMENTAL_FS)
> + seastar_add_test (fs_seastarfs
> + SOURCES fs_seastarfs_test.cc)
> +endif()
> +
> seastar_add_test (semaphore
> SOURCES semaphore_test.cc)
>

Piotr Sarna

<sarna@scylladb.com>

unread,

Apr 22, 2020, 5:39:53 AM4/22/20

to Avi Kivity, Krzysztof Małysa, seastar-dev, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Indeed, the plan is to provide more optimal implementations later (a direct io_uring-based one looks promising), and use this temporary one for now for convenience. A TODO can also be added here, as well as a short note about it in the changelog.

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 5:44:44 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> What is tested:
> - simple reads and writes
> - parallel non-overlaping writes then parallel non-overlaping reads
> - random and simultaneous reads and writes
>

Some suggestions for more testing:

- port fsx and run it (I found some version in
https://github.com/kdave/xfstests/blob/master/ltp/fsx.c, not sure it is
authoritative). It is also GPL, so it can't be directly added to this
repository. I had great success with it in the past.

- write tests that generate random reads and writes to two files, one
in the new filesystem, one from the host, and compare the results of
reads. The test just has to serialize overlapping operations, but
otherwise can generate as much concurrency as it likes.

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 5:47:08 AM4/22/20

to Piotr Sarna, Krzysztof Małysa, seastar-dev, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

io_uring should be done by the reactor, not each block device.

and use this temporary one for now for convenience. A TODO can also be added here, as well as a short note about it in the changelog.

Still, why not use the existing file abstraction?

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 5:56:47 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> - units.hh: basic units
> - cluster.hh: cluster_id unit and operations on it (converting cluster
> ids to offsets)
> - inode.hh: inode unit and operations on it (extracting shard_no from
> inode and allocating new inode)
> - bitwise.hh: bitwise operations
> - range.hh: range abstraction

>
> Signed-off-by: Krzysztof Małysa <var...@gmail.com>
> ---

> src/fs/bitwise.hh | 125 ++++++++++++++++++++++++++++++++++++++++++++++
> src/fs/cluster.hh | 42 ++++++++++++++++
> src/fs/inode.hh | 80 +++++++++++++++++++++++++++++
> src/fs/range.hh | 61 ++++++++++++++++++++++
> src/fs/units.hh | 40 +++++++++++++++
> CMakeLists.txt | 5 ++
> 6 files changed, 353 insertions(+)
> create mode 100644 src/fs/bitwise.hh
> create mode 100644 src/fs/cluster.hh
> create mode 100644 src/fs/inode.hh
> create mode 100644 src/fs/range.hh
> create mode 100644 src/fs/units.hh
>
> diff --git a/src/fs/bitwise.hh b/src/fs/bitwise.hh
> new file mode 100644
> index 00000000..e53c1919
> --- /dev/null
> +++ b/src/fs/bitwise.hh
> @@ -0,0 +1,125 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include <cassert>
> +#include <type_traits>

> +
> +namespace seastar::fs {
> +

> +template<class T, std::enable_if_t<std::is_unsigned_v<T>, int> = 0>
> +constexpr inline bool is_power_of_2(T x) noexcept {
> + return (x > 0 and (x & (x - 1)) == 0);
> +}
> +
> +static_assert(not is_power_of_2(0u));
> +static_assert(is_power_of_2(1u));
> +static_assert(is_power_of_2(2u));
> +static_assert(not is_power_of_2(3u));
> +static_assert(is_power_of_2(4u));
> +static_assert(not is_power_of_2(5u));
> +static_assert(not is_power_of_2(6u));
> +static_assert(not is_power_of_2(7u));

Please avoid the cute "not" keyword. While it's standard and all that,
it's surprising to those unfamiliar with it.

You can move all those static asserts to a .cc file so they aren't
evaluated every time this is included.

> +static_assert(is_power_of_2(8u));
> +
> +template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
> +constexpr inline T div_by_power_of_2(T a, U b) noexcept {
> + assert(is_power_of_2(b));
> + return (a >> __builtin_ctzll(b)); // should be 2 CPU cycles after inlining on modern x86_64
> +}

The old-time approach is to define these operations in terms of their
log2 values so these tricks aren't needed. But this is okay too, even
though it requires non-standard __builtin things.

> +
> +static_assert(div_by_power_of_2(13u, 1u) == 13);
> +static_assert(div_by_power_of_2(12u, 4u) == 3);
> +static_assert(div_by_power_of_2(42u, 32u) == 1);
> +
> +template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
> +constexpr inline T mod_by_power_of_2(T a, U b) noexcept {
> + assert(is_power_of_2(b));
> + return (a & (b - 1));
> +}
> +
> +static_assert(mod_by_power_of_2(13u, 1u) == 0);
> +static_assert(mod_by_power_of_2(42u, 32u) == 10);
> +
> +template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
> +constexpr inline T mul_by_power_of_2(T a, U b) noexcept {
> + assert(is_power_of_2(b));
> + return (a << __builtin_ctzll(b)); // should be 2 CPU cycles after inlining on modern x86_64
> +}
> +
> +static_assert(mul_by_power_of_2(3u, 1u) == 3);
> +static_assert(mul_by_power_of_2(3u, 4u) == 12);
> +
> +template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
> +constexpr inline T round_down_to_multiple_of_power_of_2(T a, U b) noexcept {
> + return a - mod_by_power_of_2(a, b);
> +}
> +
> +static_assert(round_down_to_multiple_of_power_of_2(0u, 1u) == 0);
> +static_assert(round_down_to_multiple_of_power_of_2(1u, 1u) == 1);
> +static_assert(round_down_to_multiple_of_power_of_2(19u, 1u) == 19);
> +
> +static_assert(round_down_to_multiple_of_power_of_2(0u, 2u) == 0);
> +static_assert(round_down_to_multiple_of_power_of_2(1u, 2u) == 0);
> +static_assert(round_down_to_multiple_of_power_of_2(2u, 2u) == 2);
> +static_assert(round_down_to_multiple_of_power_of_2(3u, 2u) == 2);
> +static_assert(round_down_to_multiple_of_power_of_2(4u, 2u) == 4);
> +static_assert(round_down_to_multiple_of_power_of_2(5u, 2u) == 4);
> +
> +static_assert(round_down_to_multiple_of_power_of_2(31u, 16u) == 16);
> +static_assert(round_down_to_multiple_of_power_of_2(32u, 16u) == 32);
> +static_assert(round_down_to_multiple_of_power_of_2(33u, 16u) == 32);
> +static_assert(round_down_to_multiple_of_power_of_2(37u, 16u) == 32);
> +static_assert(round_down_to_multiple_of_power_of_2(39u, 16u) == 32);
> +static_assert(round_down_to_multiple_of_power_of_2(45u, 16u) == 32);
> +static_assert(round_down_to_multiple_of_power_of_2(47u, 16u) == 32);
> +static_assert(round_down_to_multiple_of_power_of_2(48u, 16u) == 48);
> +static_assert(round_down_to_multiple_of_power_of_2(49u, 16u) == 48);
> +
> +template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v, int> = 0>
> +constexpr inline T round_up_to_multiple_of_power_of_2(T a, U b) noexcept {
> + auto mod = mod_by_power_of_2(a, b);
> + return (mod == 0 ? a : a - mod + b);
> +}
> +
> +static_assert(round_up_to_multiple_of_power_of_2(0u, 1u) == 0);
> +static_assert(round_up_to_multiple_of_power_of_2(1u, 1u) == 1);
> +static_assert(round_up_to_multiple_of_power_of_2(19u, 1u) == 19);
> +
> +static_assert(round_up_to_multiple_of_power_of_2(0u, 2u) == 0);
> +static_assert(round_up_to_multiple_of_power_of_2(1u, 2u) == 2);
> +static_assert(round_up_to_multiple_of_power_of_2(2u, 2u) == 2);
> +static_assert(round_up_to_multiple_of_power_of_2(3u, 2u) == 4);
> +static_assert(round_up_to_multiple_of_power_of_2(4u, 2u) == 4);
> +static_assert(round_up_to_multiple_of_power_of_2(5u, 2u) == 6);
> +
> +static_assert(round_up_to_multiple_of_power_of_2(31u, 16u) == 32);
> +static_assert(round_up_to_multiple_of_power_of_2(32u, 16u) == 32);
> +static_assert(round_up_to_multiple_of_power_of_2(33u, 16u) == 48);
> +static_assert(round_up_to_multiple_of_power_of_2(37u, 16u) == 48);
> +static_assert(round_up_to_multiple_of_power_of_2(39u, 16u) == 48);
> +static_assert(round_up_to_multiple_of_power_of_2(45u, 16u) == 48);
> +static_assert(round_up_to_multiple_of_power_of_2(47u, 16u) == 48);
> +static_assert(round_up_to_multiple_of_power_of_2(48u, 16u) == 48);
> +static_assert(round_up_to_multiple_of_power_of_2(49u, 16u) == 64);

> +
> +} // namespace seastar::fs

> diff --git a/src/fs/cluster.hh b/src/fs/cluster.hh
> new file mode 100644
> index 00000000..a35ce323
> --- /dev/null
> +++ b/src/fs/cluster.hh
> @@ -0,0 +1,42 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include "fs/bitwise.hh"
> +#include "fs/units.hh"

> +
> +namespace seastar::fs {
> +

> +using cluster_id_t = uint64_t;
> +using cluster_range = range<cluster_id_t>;
> +
> +inline cluster_id_t offset_to_cluster_id(disk_offset_t offset, unit_size_t cluster_size) noexcept {
> + assert(is_power_of_2(cluster_size));
> + return div_by_power_of_2(offset, cluster_size);
> +}
> +
> +inline disk_offset_t cluster_id_to_offset(cluster_id_t cluster_id, unit_size_t cluster_size) noexcept {
> + assert(is_power_of_2(cluster_size));
> + return mul_by_power_of_2(cluster_id, cluster_size);

> +}
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/inode.hh b/src/fs/inode.hh
> new file mode 100644
> index 00000000..aabc8d00
> --- /dev/null
> +++ b/src/fs/inode.hh
> @@ -0,0 +1,80 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include "fs/bitwise.hh"
> +#include "fs/units.hh"
> +
> +#include <cstdint>
> +#include <optional>

> +
> +namespace seastar::fs {
> +

> +// Last log2(fs_shards_pool_size bits) of the inode number contain the id of shard that owns the inode
> +using inode_t = uint64_t;
> +
> +// Obtains shard id of the shard owning @p inode.
> +//@p fs_shards_pool_size is the number of file system shards rounded up to a power of 2
> +inline fs_shard_id_t inode_to_shard_no(inode_t inode, fs_shard_id_t fs_shards_pool_size) noexcept {
> + assert(is_power_of_2(fs_shards_pool_size));
> + return mod_by_power_of_2(inode, fs_shards_pool_size);
> +}
> +
> +// Returns inode belonging to the shard owning @p shard_previous_inode that is next after @p shard_previous_inode
> +// (i.e. the lowest inode greater than @p shard_previous_inode belonging to the same shard)
> +//@p fs_shards_pool_size is the number of file system shards rounded up to a power of 2
> +inline inode_t shard_next_inode(inode_t shard_previous_inode, fs_shard_id_t fs_shards_pool_size) noexcept {
> + return shard_previous_inode + fs_shards_pool_size;
> +}
> +
> +// Returns first inode (lowest by value) belonging to the shard @p fs_shard_id
> +inline inode_t shard_first_inode(fs_shard_id_t fs_shard_id) noexcept {
> + return fs_shard_id;
> +}
> +
> +class shard_inode_allocator {
> + fs_shard_id_t _fs_shards_pool_size;
> + fs_shard_id_t _fs_shard_id;
> + std::optional<inode_t> _latest_allocated_inode;
> +
> +public:
> + shard_inode_allocator(fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id, std::optional<inode_t> latest_allocated_inode = std::nullopt)
> + : _fs_shards_pool_size(fs_shards_pool_size)
> + , _fs_shard_id(fs_shard_id)
> + , _latest_allocated_inode(latest_allocated_inode) {}
> +
> + inode_t alloc() noexcept {
> + if (not _latest_allocated_inode) {
> + _latest_allocated_inode = shard_first_inode(_fs_shard_id);
> + return *_latest_allocated_inode;
> + }
> +
> + _latest_allocated_inode = shard_next_inode(*_latest_allocated_inode, _fs_shards_pool_size);
> + return *_latest_allocated_inode;
> + }
> +
> + void reset(std::optional<inode_t> latest_allocated_inode = std::nullopt) noexcept {
> + _latest_allocated_inode = latest_allocated_inode;

> + }
> +};
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/range.hh b/src/fs/range.hh
> new file mode 100644
> index 00000000..ef0c6756
> --- /dev/null
> +++ b/src/fs/range.hh
> @@ -0,0 +1,61 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include <algorithm>

> +
> +namespace seastar::fs {
> +

> +template <class T>
> +struct range {
> + T beg;
> + T end; // exclusive
> +
> + constexpr bool is_empty() const noexcept { return beg >= end; }
> +
> + constexpr T size() const noexcept { return end - beg; }
> +};

We try to use all-or-nothing with structs/classes. Either every data
member is public and there are no member functions, or every data member
is private, we declare as "class", and everything is done via member
functions.

> +
> +template <class T>
> +range(T beg, T end) -> range<T>;
> +
> +template <class T>
> +inline bool operator==(range<T> a, range<T> b) noexcept {
> + return (a.beg == b.beg and a.end == b.end);
> +}

&& etc.

> +
> +template <class T>
> +inline bool operator!=(range<T> a, range<T> b) noexcept {
> + return not (a == b);
> +}
> +
> +template <class T>
> +inline range<T> intersection(range<T> a, range<T> b) noexcept {
> + return {std::max(a.beg, b.beg), std::min(a.end, b.end)};
> +}
> +
> +template <class T>
> +inline bool are_intersecting(range<T> a, range<T> b) noexcept {
> + return (std::max(a.beg, b.beg) < std::min(a.end, b.end));

> +}
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/units.hh b/src/fs/units.hh
> new file mode 100644
> index 00000000..1fc6754b
> --- /dev/null
> +++ b/src/fs/units.hh
> @@ -0,0 +1,40 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include "fs/range.hh"
> +
> +#include <cstdint>

> +
> +namespace seastar::fs {
> +

> +using unit_size_t = uint32_t;
> +

What is a unit? A sector? if so then 32 bits aren't enough (address just
16TB).

> +using disk_offset_t = uint64_t;
> +using disk_range = range<disk_offset_t>;
> +
> +using file_offset_t = uint64_t;
> +using file_range = range<file_offset_t>;
> +
> +using fs_shard_id_t = uint32_t;

> +
> +} // namespace seastar::fs

> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index 39d11ad8..8ad08c7a 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -660,7 +660,12 @@ if (Seastar_EXPERIMENTAL_FS)
> include/seastar/fs/block_device.hh
> include/seastar/fs/file.hh
> include/seastar/fs/temporary_file.hh
> + src/fs/bitwise.hh
> + src/fs/cluster.hh
> src/fs/file.cc
> + src/fs/inode.hh
> + src/fs/range.hh
> + src/fs/units.hh
> )
> endif()
>

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 6:06:50 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> From: Wojciech Mitros <wmi...@protonmail.com>
>
> Disk space is divided into segments of set size, called clusters. Each shard of
> the filesystem will be assigned a set of clusters. Cluster allocator is the tool
> that enables allocating and freeing clusters from that set.

>
> Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
> ---

> src/fs/cluster_allocator.hh | 50 ++++++++++++++++++++++++++++++++++
> src/fs/cluster_allocator.cc | 54 +++++++++++++++++++++++++++++++++++++
> CMakeLists.txt | 2 ++
> 3 files changed, 106 insertions(+)
> create mode 100644 src/fs/cluster_allocator.hh
> create mode 100644 src/fs/cluster_allocator.cc
>
> diff --git a/src/fs/cluster_allocator.hh b/src/fs/cluster_allocator.hh
> new file mode 100644
> index 00000000..ef4f30b9
> --- /dev/null
> +++ b/src/fs/cluster_allocator.hh
> @@ -0,0 +1,50 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB
> + */
> +
> +#pragma once
> +

> +#include "fs/cluster.hh"
> +
> +#include <deque>
> +#include <optional>
> +#include <unordered_set>
> +
> +namespace seastar {
> +
> +namespace fs {
> +
> +class cluster_allocator {
> + std::unordered_set<cluster_id_t> _allocated_clusters;
> + std::deque<cluster_id_t> _free_clusters;

This looks inefficient in terms of memory storage. I expect we'll have
around 32 bytes/cluster. If a cluster is 1 MiB that's a ratio of 32k:1,
which isn't too bad, but this should be optimized later,

There is also a problem with large allocations happening dynamically. We
can contribute Scylla's dynamic_bitset which eats just 1 bit/cluster.
All this can happen later, it's fine to start with unordered_set/deque.

> +
> +public:
> + explicit cluster_allocator(std::unordered_set<cluster_id_t> allocated_clusters, std::deque<cluster_id_t> free_clusters);
> +
> + // Tries to allocate a cluster
> + std::optional<cluster_id_t> alloc();
> +
> + // @p cluster_id has to be allocated using alloc()
> + void free(cluster_id_t cluster_id);
> +};
> +
> +}
> +
> +}
> diff --git a/src/fs/cluster_allocator.cc b/src/fs/cluster_allocator.cc
> new file mode 100644
> index 00000000..c436c7ba
> --- /dev/null
> +++ b/src/fs/cluster_allocator.cc
> @@ -0,0 +1,54 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB
> + */
> +

> +#include "fs/cluster.hh"
> +#include "fs/cluster_allocator.hh"
> +
> +#include <cassert>
> +#include <optional>
> +
> +namespace seastar {
> +
> +namespace fs {
> +
> +cluster_allocator::cluster_allocator(std::unordered_set<cluster_id_t> allocated_clusters, std::deque<cluster_id_t> free_clusters)
> + : _allocated_clusters(std::move(allocated_clusters)), _free_clusters(std::move(free_clusters)) {}
> +
> +std::optional<cluster_id_t> cluster_allocator::alloc() {
> + if (_free_clusters.empty()) {
> + return std::nullopt;
> + }
> +
> + cluster_id_t cluster_id = _free_clusters.front();
> + _free_clusters.pop_front();
> + _allocated_clusters.insert(cluster_id);
> + return cluster_id;
> +}
> +
> +void cluster_allocator::free(cluster_id_t cluster_id) {
> + assert(_allocated_clusters.count(cluster_id) == 1);
> + _free_clusters.emplace_back(cluster_id);

What if this fails? Maybe we should reserve _free_clusters up front (and
use a vector, not a deque, since there's now no point to deque's
fragmented allocation). Usually free paths should be fail safe since
they can occur when rolling back some operation (example: c1 =
a.alloc(); auto undo = defer([&] a.free(*c1)); ... fail here ... - if we
fail, we end up throwing in the destructor.

btw, in the future, we should launder freed clusters through
discard/trim, but of course this is out of scope now.

> + _allocated_clusters.erase(cluster_id);

> +}
> +
> +}
> +
> +}

> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index 8ad08c7a..891201a3 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -662,6 +662,8 @@ if (Seastar_EXPERIMENTAL_FS)
> include/seastar/fs/temporary_file.hh
> src/fs/bitwise.hh
> src/fs/cluster.hh
> + src/fs/cluster_allocator.cc
> + src/fs/cluster_allocator.hh
> src/fs/file.cc
> src/fs/inode.hh
> src/fs/range.hh

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 6:17:24 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> From: Michał Niciejewski <qup...@gmail.com>
>
> Bootstrap record serves the same role as the superblock in other
> filesystems.
> It contains basic information essential to properly bootstrap the
> filesystem:
> - filesystem version
> - alignment used for data writes
> - cluster size
> - inode number of the root directory
> - information needed to bootstrap every shard of the filesystem:
> * id of the first metadata log cluster
> * range of available clusters for data and metadata

>
> Signed-off-by: Michał Niciejewski <qup...@gmail.com>
> ---

> src/fs/bootstrap_record.hh | 98 ++++++++++++++++++
> src/fs/crc.hh | 34 ++++++
> src/fs/bootstrap_record.cc | 206 +++++++++++++++++++++++++++++++++++++
> CMakeLists.txt | 3 +
> 4 files changed, 341 insertions(+)
> create mode 100644 src/fs/bootstrap_record.hh
> create mode 100644 src/fs/crc.hh
> create mode 100644 src/fs/bootstrap_record.cc
>
> diff --git a/src/fs/bootstrap_record.hh b/src/fs/bootstrap_record.hh
> new file mode 100644
> index 00000000..ee15295a
> --- /dev/null
> +++ b/src/fs/bootstrap_record.hh
> @@ -0,0 +1,98 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB
> + */
> +
> +#pragma once
> +
> +#include "fs/cluster.hh"

> +#include "fs/inode.hh"
> +#include "seastar/fs/block_device.hh"
> +
> +#include <exception>

> +
> +namespace seastar::fs {
> +

> +class invalid_bootstrap_record : public std::runtime_error {
> +public:
> + explicit invalid_bootstrap_record(const std::string& msg) : std::runtime_error(msg) {}
> + explicit invalid_bootstrap_record(const char* msg) : std::runtime_error(msg) {}
> +};
> +
> +/// In-memory version of the record describing characteristics of the file system (~superblock).
> +class bootstrap_record {
> +public:
> + static constexpr uint64_t magic_number = 0x5343594c4c414653; // SCYLLAFS

Please use seastar, this isn't Scylla specific.

> + static constexpr uint32_t max_shards_nb = 500;
> + static constexpr unit_size_t min_alignment = 4096;
> +
> + struct shard_info {
> + cluster_id_t metadata_cluster; /// cluster id of the first metadata log cluster
> + cluster_range available_clusters; /// range of clusters for data for this shard
> + };
> +
> + uint64_t version; /// file system version
> + unit_size_t alignment; /// write alignment in bytes
> + unit_size_t cluster_size; /// cluster size in bytes
> + inode_t root_directory; /// root dir inode number
> + std::vector<shard_info> shards_info; /// basic informations about each file system shard
> +

Please follow the all-or-nothing public member principle.

> + bootstrap_record() = default;
> + bootstrap_record(uint64_t version, unit_size_t alignment, unit_size_t cluster_size, inode_t root_directory,
> + std::vector<shard_info> shards_info)
> + : version(version), alignment(alignment), cluster_size(cluster_size) , root_directory(root_directory)
> + , shards_info(std::move(shards_info)) {}
> +
> + /// number of file system shards
> + uint32_t shards_nb() const noexcept {
> + return shards_info.size();
> + }
> +
> + static future<bootstrap_record> read_from_disk(block_device& device);
> + future<> write_to_disk(block_device& device) const;
> +
> + friend bool operator==(const bootstrap_record&, const bootstrap_record&) noexcept;
> + friend bool operator!=(const bootstrap_record&, const bootstrap_record&) noexcept;
> +};
> +
> +inline bool operator==(const bootstrap_record::shard_info& lhs, const bootstrap_record::shard_info& rhs) noexcept {
> + return lhs.metadata_cluster == rhs.metadata_cluster and lhs.available_clusters == rhs.available_clusters;
> +}
> +
> +inline bool operator!=(const bootstrap_record::shard_info& lhs, const bootstrap_record::shard_info& rhs) noexcept {
> + return !(lhs == rhs);
> +}
> +
> +inline bool operator!=(const bootstrap_record& lhs, const bootstrap_record& rhs) noexcept {
> + return !(lhs == rhs);
> +}
> +
> +/// On-disk version of the record describing characteristics of the file system (~superblock).
> +struct bootstrap_record_disk {
> + uint64_t magic;
> + uint64_t version;
> + unit_size_t alignment;
> + unit_size_t cluster_size;

Better to use explicitly sized units for on-disk data structures so it's
absolutely clear and doesn't change if we decide to change something.

> + inode_t root_directory;
> + uint32_t shards_nb;
> + bootstrap_record::shard_info shards_info[bootstrap_record::max_shards_nb];
> + uint32_t crc;
> +};

If this goes directly to disk, it should be [[gnu::packed]] to avoid
different alignment constraints from generating different layouts. It
should also have a defined endianess.

Let's add feature bitmaps for compatibility. It is customary to use a
compatible bitmap and an incompatible bitmap - you can mount a
filesystem with some unknown compatible bits set (but have to clear them
if you write to the filesystem), but you can't mount a filesystem with
incompatible bits. This is more flexible than version numbers.

btw, you can resolve these with "FIXMEs", they are critical for
production but they shouldn't block development.

> +
> +}
> diff --git a/src/fs/crc.hh b/src/fs/crc.hh
> new file mode 100644
> index 00000000..da557323
> --- /dev/null
> +++ b/src/fs/crc.hh
> @@ -0,0 +1,34 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB
> + */
> +
> +#pragma once
> +

> +#include <boost/crc.hpp>

> +
> +namespace seastar::fs {
> +

> +inline uint32_t crc32(const void* buff, size_t len) noexcept {
> + boost::crc_32_type result;
> + result.process_bytes(buff, len);
> + return result.checksum();
> +}
> +
> +}
> diff --git a/src/fs/bootstrap_record.cc b/src/fs/bootstrap_record.cc
> new file mode 100644
> index 00000000..a342efb6
> --- /dev/null
> +++ b/src/fs/bootstrap_record.cc
> @@ -0,0 +1,206 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB
> + */
> +

> +#include "fs/bootstrap_record.hh"
> +#include "fs/crc.hh"
> +#include "seastar/core/print.hh"
> +#include "seastar/core/units.hh"

> +
> +namespace seastar::fs {
> +

> +namespace {
> +
> +constexpr unit_size_t write_alignment = 4 * KB;
> +constexpr disk_offset_t bootstrap_record_offset = 0;
> +
> +constexpr size_t aligned_bootstrap_record_size =
> + (1 + (sizeof(bootstrap_record_disk) - 1) / write_alignment) * write_alignment;

This number should be defined, not calculated. We don't want it changing
as we add more stuff.

> +constexpr size_t crc_offset = offsetof(bootstrap_record_disk, crc);
> +
> +inline std::optional<invalid_bootstrap_record> check_alignment(unit_size_t alignment) {
> + if (!is_power_of_2(alignment)) {
> + return invalid_bootstrap_record(fmt::format("Alignment should be a power of 2, read alignment '{}'",
> + alignment));
> + }
> + if (alignment < bootstrap_record::min_alignment) {
> + return invalid_bootstrap_record(fmt::format("Alignment should be greater or equal to {}, read alignment '{}'",
> + bootstrap_record::min_alignment, alignment));
> + }

> + return std::nullopt;
> +}
> +

> +inline std::optional<invalid_bootstrap_record> check_cluster_size(unit_size_t cluster_size, unit_size_t alignment) {
> + if (!is_power_of_2(cluster_size)) {
> + return invalid_bootstrap_record(fmt::format("Cluster size should be a power of 2, read cluster size '{}'", cluster_size));
> + }
> + if (cluster_size % alignment != 0) {
> + return invalid_bootstrap_record(fmt::format(
> + "Cluster size should be divisible by alignment, read alignment '{}', read cluster size '{}'",
> + alignment, cluster_size));
> + }

> + return std::nullopt;
> +}
> +

> +inline std::optional<invalid_bootstrap_record> check_shards_number(uint32_t shards_nb) {
> + if (shards_nb == 0) {
> + return invalid_bootstrap_record(fmt::format("Shards number should be greater than 0, read shards number '{}'",
> + shards_nb));
> + }
> + if (shards_nb > bootstrap_record::max_shards_nb) {
> + return invalid_bootstrap_record(fmt::format(
> + "Shards number should be smaller or equal to {}, read shards number '{}'",
> + bootstrap_record::max_shards_nb, shards_nb));
> + }

> + return std::nullopt;
> +}
> +

> +std::optional<invalid_bootstrap_record> check_shards_info(std::vector<bootstrap_record::shard_info> shards_info) {
> + // check 1 <= beg <= metadata_cluster < end
> + for (const bootstrap_record::shard_info& info : shards_info) {
> + if (info.available_clusters.beg >= info.available_clusters.end) {
> + return invalid_bootstrap_record(fmt::format("Invalid cluster range, read cluster range [{}, {})",
> + info.available_clusters.beg, info.available_clusters.end));
> + }
> + if (info.available_clusters.beg == 0) {
> + return invalid_bootstrap_record(fmt::format(
> + "Range of available clusters should not contain cluster 0, read cluster range [{}, {})",
> + info.available_clusters.beg, info.available_clusters.end));
> + }
> + if (info.available_clusters.beg > info.metadata_cluster ||
> + info.available_clusters.end <= info.metadata_cluster) {
> + return invalid_bootstrap_record(fmt::format(
> + "Cluster with metadata should be inside available cluster range, read cluster range [{}, {}), read metadata cluster '{}'",
> + info.available_clusters.beg, info.available_clusters.end, info.metadata_cluster));
> + }
> + }
> +
> + // check that ranges don't overlap
> + sort(shards_info.begin(), shards_info.end(),
> + [] (const bootstrap_record::shard_info& left,
> + const bootstrap_record::shard_info& right) {
> + return left.available_clusters.beg < right.available_clusters.beg;
> + });
> + for (size_t i = 1; i < shards_info.size(); i++) {
> + if (shards_info[i - 1].available_clusters.end > shards_info[i].available_clusters.beg) {
> + return invalid_bootstrap_record(fmt::format(
> + "Cluster ranges should not overlap, overlaping ranges [{}, {}), [{}, {})",
> + shards_info[i - 1].available_clusters.beg, shards_info[i - 1].available_clusters.end,
> + shards_info[i].available_clusters.beg, shards_info[i].available_clusters.end));
> + }
> + }

> + return std::nullopt;
> +}
> +
> +}

> +
> +future<bootstrap_record> bootstrap_record::read_from_disk(block_device& device) {
> + auto bootstrap_record_buff = temporary_buffer<char>::aligned(write_alignment, aligned_bootstrap_record_size);
> + return device.read(bootstrap_record_offset, bootstrap_record_buff.get_write(), aligned_bootstrap_record_size)
> + .then([bootstrap_record_buff = std::move(bootstrap_record_buff)] (size_t ret) {
> + if (ret != aligned_bootstrap_record_size) {
> + return make_exception_future<bootstrap_record>(
> + invalid_bootstrap_record(fmt::format(
> + "Error while reading bootstrap record block, {} bytes read instead of {}",
> + ret, aligned_bootstrap_record_size)));
> + }
> +
> + bootstrap_record_disk bootstrap_record_disk;
> + std::memcpy(&bootstrap_record_disk, bootstrap_record_buff.get(), sizeof(bootstrap_record_disk));
> +
> + const uint32_t crc_calc = crc32(bootstrap_record_buff.get(), crc_offset);
> + if (crc_calc != bootstrap_record_disk.crc) {
> + return make_exception_future<bootstrap_record>(
> + invalid_bootstrap_record(fmt::format("Invalid CRC, expected crc '{}', read crc '{}'",
> + crc_calc, bootstrap_record_disk.crc)));
> + }
> + if (magic_number != bootstrap_record_disk.magic) {
> + return make_exception_future<bootstrap_record>(
> + invalid_bootstrap_record(fmt::format("Invalid magic number, expected magic '{}', read magic '{}'",
> + magic_number, bootstrap_record_disk.magic)));
> + }
> + if (std::optional<invalid_bootstrap_record> ret_check;
> + (ret_check = check_alignment(bootstrap_record_disk.alignment)) ||
> + (ret_check = check_cluster_size(bootstrap_record_disk.cluster_size, bootstrap_record_disk.alignment)) ||
> + (ret_check = check_shards_number(bootstrap_record_disk.shards_nb))) {
> + return make_exception_future<bootstrap_record>(ret_check.value());
> + }
> +
> + const std::vector<shard_info> tmp_shards_info(bootstrap_record_disk.shards_info,
> + bootstrap_record_disk.shards_info + bootstrap_record_disk.shards_nb);
> +
> + if (std::optional<invalid_bootstrap_record> ret_check;
> + (ret_check = check_shards_info(tmp_shards_info))) {
> + return make_exception_future<bootstrap_record>(ret_check.value());
> + }
> +
> + bootstrap_record bootstrap_record_mem(bootstrap_record_disk.version,
> + bootstrap_record_disk.alignment,
> + bootstrap_record_disk.cluster_size,
> + bootstrap_record_disk.root_directory,
> + std::move(tmp_shards_info));
> +
> + return make_ready_future<bootstrap_record>(std::move(bootstrap_record_mem));
> + });
> +}
> +
> +future<> bootstrap_record::write_to_disk(block_device& device) const {
> + // initial checks
> + if (std::optional<invalid_bootstrap_record> ret_check;
> + (ret_check = check_alignment(alignment)) ||
> + (ret_check = check_cluster_size(cluster_size, alignment)) ||
> + (ret_check = check_shards_number(shards_nb())) ||
> + (ret_check = check_shards_info(shards_info))) {
> + return make_exception_future<>(ret_check.value());
> + }
> +
> + auto bootstrap_record_buff = temporary_buffer<char>::aligned(write_alignment, aligned_bootstrap_record_size);
> + std::memset(bootstrap_record_buff.get_write(), 0, aligned_bootstrap_record_size);
> + bootstrap_record_disk* bootstrap_record_disk = (struct bootstrap_record_disk*)bootstrap_record_buff.get_write();
> +
> + // prepare bootstrap_record_disk records
> + bootstrap_record_disk->magic = bootstrap_record::magic_number;
> + bootstrap_record_disk->version = version;
> + bootstrap_record_disk->alignment = alignment;
> + bootstrap_record_disk->cluster_size = cluster_size;
> + bootstrap_record_disk->root_directory = root_directory;
> + bootstrap_record_disk->shards_nb = shards_nb();
> + std::copy(shards_info.begin(), shards_info.end(), bootstrap_record_disk->shards_info);
> + bootstrap_record_disk->crc = crc32(bootstrap_record_disk, crc_offset);
> +
> + return device.write(bootstrap_record_offset, bootstrap_record_buff.get(), aligned_bootstrap_record_size)
> + .then([bootstrap_record_buff = std::move(bootstrap_record_buff)] (size_t ret) {
> + if (ret != aligned_bootstrap_record_size) {
> + return make_exception_future<>(
> + invalid_bootstrap_record(fmt::format(
> + "Error while writing bootstrap record block to disk, {} bytes written instead of {}",
> + ret, aligned_bootstrap_record_size)));
> + }
> + return make_ready_future<>();
> + });
> +}
> +
> +bool operator==(const bootstrap_record& lhs, const bootstrap_record& rhs) noexcept {
> + return lhs.version == rhs.version and lhs.alignment == rhs.alignment and
> + lhs.cluster_size == rhs.cluster_size and lhs.root_directory == rhs.root_directory and
> + lhs.shards_info == rhs.shards_info;

> +}
> +
> +}
> diff --git a/CMakeLists.txt b/CMakeLists.txt

> index 891201a3..ca994d42 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -661,9 +661,12 @@ if (Seastar_EXPERIMENTAL_FS)
> include/seastar/fs/file.hh
> include/seastar/fs/temporary_file.hh
> src/fs/bitwise.hh
> + src/fs/bootstrap_record.cc
> + src/fs/bootstrap_record.hh
> src/fs/cluster.hh
> src/fs/cluster_allocator.cc
> src/fs/cluster_allocator.hh
> + src/fs/crc.hh
> src/fs/file.cc
> src/fs/inode.hh
> src/fs/range.hh

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 6:19:03 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> path.hh provides extract_last_component() function that extracts the
> last component of the provided path

Is <filesystem> not sufficient for this?

> Signed-off-by: Krzysztof Małysa <var...@gmail.com>
> ---

> src/fs/path.hh | 42 ++++++++++++++++++
> tests/unit/fs_path_test.cc | 90 ++++++++++++++++++++++++++++++++++++++
> CMakeLists.txt | 1 +
> tests/unit/CMakeLists.txt | 3 ++
> 4 files changed, 136 insertions(+)
> create mode 100644 src/fs/path.hh
> create mode 100644 tests/unit/fs_path_test.cc
>
> diff --git a/src/fs/path.hh b/src/fs/path.hh
> new file mode 100644
> index 00000000..9da4c517
> --- /dev/null
> +++ b/src/fs/path.hh
> @@ -0,0 +1,42 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB
> + */
> +
> +#pragma once
> +

> +#include <string>

> +
> +namespace seastar::fs {
> +

> +// Extracts the last component in @p path. WARNING: The last component is empty iff @p path is empty or ends with '/'
> +inline std::string extract_last_component(std::string& path) {
> + auto beg = path.find_last_of('/');
> + if (beg == path.npos) {
> + std::string res = std::move(path);
> + path = {};
> + return res;
> + }
> +
> + auto res = path.substr(beg + 1);
> + path.resize(beg + 1);
> + return res;

> +}
> +
> +} // namespace seastar::fs

> diff --git a/tests/unit/fs_path_test.cc b/tests/unit/fs_path_test.cc
> new file mode 100644
> index 00000000..956e64d7
> --- /dev/null
> +++ b/tests/unit/fs_path_test.cc
> @@ -0,0 +1,90 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> + * Copyright (C) 2020 ScyllaDB
> + */
> +
> +#include "fs/path.hh"
> +
> +#define BOOST_TEST_MODULE fs
> +#include <boost/test/included/unit_test.hpp>
> +

> +using namespace seastar::fs;
> +

> +BOOST_AUTO_TEST_CASE(last_component_simple) {
> + {
> + std::string str = "";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
> + BOOST_REQUIRE_EQUAL(str, "");
> + }
> + {
> + std::string str = "/";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
> + BOOST_REQUIRE_EQUAL(str, "/");
> + }
> + {
> + std::string str = "/foo/bar.txt";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), "bar.txt");
> + BOOST_REQUIRE_EQUAL(str, "/foo/");
> + }
> + {
> + std::string str = "/foo/.bar";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), ".bar");
> + BOOST_REQUIRE_EQUAL(str, "/foo/");
> + }
> + {
> + std::string str = "/foo/bar/";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
> + BOOST_REQUIRE_EQUAL(str, "/foo/bar/");
> + }
> + {
> + std::string str = "/foo/.";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), ".");
> + BOOST_REQUIRE_EQUAL(str, "/foo/");
> + }
> + {
> + std::string str = "/foo/..";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), "..");
> + BOOST_REQUIRE_EQUAL(str, "/foo/");
> + }
> + {
> + std::string str = "bar.txt";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), "bar.txt");
> + BOOST_REQUIRE_EQUAL(str, "");
> + }
> + {
> + std::string str = ".bar";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), ".bar");
> + BOOST_REQUIRE_EQUAL(str, "");
> + }
> + {
> + std::string str = ".";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), ".");
> + BOOST_REQUIRE_EQUAL(str, "");
> + }
> + {
> + std::string str = "..";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), "..");
> + BOOST_REQUIRE_EQUAL(str, "");
> + }
> + {
> + std::string str = "//host";
> + BOOST_REQUIRE_EQUAL(extract_last_component(str), "host");
> + BOOST_REQUIRE_EQUAL(str, "//");

> + }
> +}
> diff --git a/CMakeLists.txt b/CMakeLists.txt

> index be3f921f..fb8fe32c 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -670,6 +670,7 @@ if (Seastar_EXPERIMENTAL_FS)
> src/fs/crc.hh
> src/fs/file.cc
> src/fs/inode.hh
> + src/fs/path.hh
> src/fs/range.hh
> src/fs/units.hh
> )
> diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
> index f9591046..07551b0b 100644
> --- a/tests/unit/CMakeLists.txt
> +++ b/tests/unit/CMakeLists.txt
> @@ -372,6 +372,9 @@ if (Seastar_EXPERIMENTAL_FS)

> seastar_add_test (fs_cluster_allocator
> KIND BOOST
> SOURCES fs_cluster_allocator_test.cc)

> + seastar_add_test (fs_path
> + KIND BOOST
> + SOURCES fs_path_test.cc)
> seastar_add_test (fs_seastarfs
> SOURCES fs_seastarfs_test.cc)
> endif()

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 6:23:11 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> value shared lock is allows to lock (using shared_mutex) a specified value.
> One operation locks only one value, but value shared lock allows you to
> maintain locks on different values in one place. Also locking is
> "on demand" i.e. corresponding shared_mutex will not be created unless a
> lock will be used on value and will be deleted as soon as the value is
> not being locked by anyone. It serves as a dynamic pool of shared_mutexes
> acquired on demand.

>
> Signed-off-by: Krzysztof Małysa <var...@gmail.com>
> ---

> src/fs/value_shared_lock.hh | 65 +++++++++++++++++++++++++++++++++++++
> CMakeLists.txt | 1 +
> 2 files changed, 66 insertions(+)
> create mode 100644 src/fs/value_shared_lock.hh
>
> diff --git a/src/fs/value_shared_lock.hh b/src/fs/value_shared_lock.hh
> new file mode 100644
> index 00000000..6c7a3adf
> --- /dev/null
> +++ b/src/fs/value_shared_lock.hh
> @@ -0,0 +1,65 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2020 ScyllaDB
> + */
> +

> +#pragma once
> +
> +#include "seastar/core/shared_mutex.hh"
> +
> +#include <map>

> +
> +namespace seastar::fs {
> +

> +template<class Value>
> +class value_shared_lock {
> + struct lock_info {
> + size_t users_num = 0;
> + shared_mutex lock;
> + };
> +
> + std::map<Value, lock_info> _locks;
> +

Why not unordered_map? Because of iterator invalidation?

> +public:
> + value_shared_lock() = default;
> +
> + template<class Func>
> + auto with_shared_on(const Value& val, Func&& func) {
> + auto it = _locks.emplace(val, lock_info {}).first;
> + ++it->second.users_num;
> + return with_shared(it->second.lock, std::forward<Func>(func)).finally([this, it] {
> + if (--it->second.users_num == 0) {
> + _locks.erase(it);
> + }
> + });
> + }
> +
> + template<class Func>
> + auto with_lock_on(const Value& val, Func&& func) {
> + auto it = _locks.emplace(val, lock_info {}).first;
> + ++it->second.users_num;
> + return with_lock(it->second.lock, std::forward<Func>(func)).finally([this, it] {
> + if (--it->second.users_num == 0) {
> + _locks.erase(it);
> + }
> + });
> + }

> +};
> +
> +} // namespace seastar::fs

> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index fb8fe32c..8a59eca6 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -673,6 +673,7 @@ if (Seastar_EXPERIMENTAL_FS)
> src/fs/path.hh
> src/fs/range.hh
> src/fs/units.hh
> + src/fs/value_shared_lock.hh
> )
> endif()
>

Piotr Sarna

<sarna@scylladb.com>

unread,

Apr 22, 2020, 6:55:44 AM4/22/20

to Avi Kivity, Krzysztof Małysa, seastar-dev, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Well, there were two original reasons. Firstly, I imagined that in the future we could potentially bypass the file abstraction entirely - and instead of opening a block device as a seastar::file by its userspace path, we could have some support for using block devices directly from the reactor (e.g. with some block-device-specific optimizations). Secondly, having a separate wrapper abstraction would express explicitly that it's not just a regular file, but a raw block device, so we could for example encapsulate all interesting metrics in one place - block device is somewhat special, so wrapping it in a more specific wrapper narrows its use-cases and provides more context. It's similar to how Scylla's materialized views have a view_ptr wrapper, which is pretty much only a schema_ptr, of which we *know* that it is a materialized view. block_device() is also quite a thin wrapper now, but it can potentially be more, should we want to provide some block-device-specific assertions or optimizations. These are not super strong points and one of them depends on a potential support from Seastar reactor which does not exist yet, so I can be convinced to drop the block_device abstraction altogether, but nonetheless, these were my original reasons.

Benny Halevy

<bhalevy@scylladb.com>

unread,

Apr 22, 2020, 7:00:58 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Hi Krzysztof,

Thanks for doing this!
One question regarding the high level interface to the filesystem.
A direction I had in mind for quite a while but never got to do it
is to define a virtual filesystem abstraction in seastar that covers both
metadata operations and I/O operations, so that we can present and easily use
multiple filesystem implementations like posix, a ram-based file system
for testing (with error injection), and now your filesystem.

Do you have any similar concept on your todo plans?

Benny

On Mon, 2020-04-20 at 14:01 +0200, Krzysztof Małysa wrote:
> github: https://github.com/psarna/seastar/commits/fs-metadata-log
>
> This series is part of the ZPP FS project that is coordinated by Piotr Sarna <sa...@scylladb.com>.
> The goal of this project is to create SeastarFS -- a fully asynchronous, sharded, user-space,
> log-structured file system that is intended to become an alternative to XFS for Scylla.
>

> 58 files changed, 9325 insertions(+)
> create mode 100644 include/seastar/fs/block_device.hh

> create mode 100644 include/seastar/fs/exceptions.hh
> create mode 100644 include/seastar/fs/file.hh
> create mode 100644 include/seastar/fs/overloaded.hh
> create mode 100644 include/seastar/fs/stat.hh
> create mode 100644 include/seastar/fs/temporary_file.hh
> create mode 100644 src/fs/bitwise.hh

> create mode 100644 src/fs/bootstrap_record.hh
> create mode 100644 src/fs/cluster.hh
> create mode 100644 src/fs/cluster_allocator.hh
> create mode 100644 src/fs/cluster_writer.hh
> create mode 100644 src/fs/crc.hh
> create mode 100644 src/fs/device_reader.hh
> create mode 100644 src/fs/inode.hh

> create mode 100644 src/fs/inode_info.hh
> create mode 100644 src/fs/metadata_disk_entries.hh
> create mode 100644 src/fs/metadata_log.hh
> create mode 100644 src/fs/metadata_log_bootstrap.hh
> create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
> create mode 100644 src/fs/metadata_log_operations/create_file.hh
> create mode 100644 src/fs/metadata_log_operations/link_file.hh
> create mode 100644 src/fs/metadata_log_operations/read.hh
> create mode 100644 src/fs/metadata_log_operations/truncate.hh
> create mode 100644 src/fs/metadata_log_operations/unlink_or_remove_file.hh
> create mode 100644 src/fs/metadata_log_operations/write.hh
> create mode 100644 src/fs/metadata_to_disk_buffer.hh

> create mode 100644 src/fs/path.hh

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 7:42:32 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> SeastarFS is a log-structured filesystem. Every shard will have 3

> private logs:
> - metadata log

> - medium data log
> - big data log (this is not actually a log, but in the big picture it

> looks like it was)
>
> Disk space is divided into clusters (typically around several MiB) that
> have all equal size that is multiple of alignment (typically 4096
> bytes). Each shard has its private pool of clusters (assignment is
> stored in bootstrap record). Each log consumes clusters one by one -- it
> writes the current one and if cluster becomes full, then log switches to
> a new one that is obtained from a pool of free clusters managed by
> cluster_allocator. Metadata log and medium data log write data in the
> same manner: they fill up the cluster gradually from left to right. Big
> data log takes a cluster and completely fills it with data at once -- it
> is only used during big writes.
>

> This commit adds the skeleton of the metadata log:
> - data structures for holding metadata in memory with all operations on
> this data structure i.e. manipulating files and their contents
> - locking logic (detailed description can be found in metadata_log.hh)
> - buffers for writting logs to disk (one for metadata and one for medium
> data)
> - basic higher level interface e.g. path lookup, iterating over
> directory
> - boostraping metadata log == reading metadata log from disk and
> reconstructing shard's filesystem structure from just before shutdown
>
> File content is stored as a set of data vectors that may have one of
> three kinds: in memory data, on disk data, hole. Small writes are
> writted directly to the metadata log and because all metadata is stored
> in the memory these writes are also in memory, therefore in-memory kind.
> Medium and large data are not stored in memory, so they are represented
> using on-disk kind. Enlarging file via truncate may produce holes, hence
> hole kind.
>
> Directory entries are stored as metadata log entries -- directory inodes
> have no content.
>
> To disk buffers buffer data that will be written to disk. There are two
> kinds: (normal) to disk buffer and metadata to disk buffer. The latter
> is implemented using the former, but provides higher level interface for
> appending metadata log entries rather than raw bytes.
>
> Normal to disk buffer appends data sequentially, but if a flush occurs
> the offset where next data will be appended is aligned up to alignment
> to ensure that writes to the same cluster are non-overlaping.
>
> Metadata to disk buffer appends data using normal to disk buffer but
> does some formatting along the way. The structure of the metadata log on
> disk is as follows:
> | checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... |
> | <---- checkpointed data -----> |
> etc. Every batch of metadata_log entries is preceded by a checkpoint
> entry. Appending metadata log appends the current batch of entries.
> Flushing or lack of space ends current batch of entries and then
> checkpoint entry is updated (because it holds CRC code of all
> checkpointed data) and then write of the whole batch is requested and a
> new checkpoint (if there is space for that) is started. Last checkpoint
> in a cluster contains a special entry pointing to the next cluster that
> is utilized by the metadata log.
>
> Bootstraping is, in fact, just replying of all actions from metadata log
> that were saved on disk. It works as follows:
> - reads metadata log clusters one by one
> - for each cluster, until the last checkpoint contains pointer to the
> next cluster, processes the checkpoint and entries it checkpoints
> - processing works as follows:
> - checkpoint entry is read and if it is invalid it means that the
> metadata log ends here (last checkpoint was partially written or the
> metadata log really ended here or there was some data corruption...)
> and we stop
> - if it is correct, it contains the length of the checkpointed data
> (metadata log entries), so then we process all of them (error there
> indicates that there was data corruption but CRC is still somehow
> correct, so we abort all bootstraping with an error)
>
> Locking is to ensure that concurrent modifications of the metadata do
> not corrupt it. E.g. Creating a file is a complex operation: you have
> to create inode and add a directory entry that will represent this inode
> with a path and write corresponding metadata log entries to the disk.
> Simultaneous attempts of creating the same file could corrupt the file
> system. Not to mention concurrent create and unlink on the same path...
> Thus careful and robust locking mechanism is used. For details see
> metadata_log.hh.
>

> diff --git a/src/fs/inode_info.hh b/src/fs/inode_info.hh
> new file mode 100644
> index 00000000..89bc71d8
> --- /dev/null
> +++ b/src/fs/inode_info.hh

> +
> +namespace seastar::fs {
> +

> +struct inode_data_vec {
> + file_range data_range; // data spans [beg, end) range of the file
> +
> + struct in_mem_data {

> + temporary_buffer<uint8_t> data;
> + };
> +

> + struct on_disk_data {
> + file_offset_t device_offset;
> + };

No separate representation for medium data? The offsets can change while
rewriting the log.

> +
> + struct hole_data { };
> +
> + std::variant<in_mem_data, on_disk_data, hole_data> data_location;
> +
> + // TODO: rename that function to something more suitable
> + inode_data_vec share_copy() {

Maybe "shallow_copy"?

> + inode_data_vec shared;
> + shared.data_range = data_range;
> + std::visit(overloaded {
> + [&](inode_data_vec::in_mem_data& mem) {
> + shared.data_location = inode_data_vec::in_mem_data {mem.data.share()};

Please don't add these spaces in the middle of constructors, it's
distracting.

> + },
> + [&](inode_data_vec::on_disk_data& disk_data) {
> + shared.data_location = disk_data;
> + },
> + [&](inode_data_vec::hole_data&) {
> + shared.data_location = inode_data_vec::hole_data {};
> + },
> + }, data_location);
> + return shared;
> + }
> +};
> +
> +struct inode_info {
> + uint32_t opened_files_count = 0; // Number of open files referencing inode
> + uint32_t directories_containing_file = 0;

Suppose one directory has two names for a the inode. Does it count as 1
or 2 here?

> + unix_metadata metadata;
> +
> + struct directory {
> + // TODO: directory entry cannot contain '/' character --> add checks for that
> + std::map<std::string, inode_t, std::less<>> entries; // entry name => inode

Why std::map and not unordered_map? Resumable iteration? Please add
comments.

> + };
> +
> + struct file {
> + std::map<file_offset_t, inode_data_vec> data; // file offset => data vector that begins there (data vectors
> + // do not overlap)
> +

This can be optimized, but no need to touch it now. Probably some kind
of radix tree.

> + file_offset_t size() const noexcept {
> + return (data.empty() ? 0 : data.rbegin()->second.data_range.end);
> + }
> +
> + // Deletes data vectors that are subset of @p data_range and cuts overlapping data vectors to make them
> + // not overlap. @p cut_data_vec_processor is called on each inode_data_vec (including parts of overlapped
> + // data vectors) that will be deleted
> + template<class Func>

Please add a constraint for the signature of Func, both to document it
and to catch errors earlier. I see you have a static assert, but
constraints are better since they apply to the declaration, not the body.

Consider also changing it from a template parameter to a noncopyable
function. These are all heavyweight operations so saving every cycle
isn't necessary.

> + void cut_out_data_range(file_range range, Func&& cut_data_vec_processor) {
> + static_assert(std::is_invocable_v<Func, inode_data_vec>);
> + // Cut all vectors intersecting with range
> + auto it = data.lower_bound(range.beg);
> + if (it != data.begin() and are_intersecting(range, prev(it)->second.data_range)) {
> + --it;
> + }
> +
> + while (it != data.end() and are_intersecting(range, it->second.data_range)) {
> + auto data_vec = std::move(data.extract(it++).mapped());

I presume this removes *it from the map?

> + const auto cap = intersection(range, data_vec.data_range);
> + if (cap == data_vec.data_range) {
> + // Fully intersects => remove it
> + cut_data_vec_processor(std::move(data_vec));

If cut_data_vec_processor fails, we have to reinstate *it.

> + continue;
> + }
> +
> + // Overlaps => cut it, possibly into two parts:
> + // | data_vec |
> + // | cap |
> + // | left | mid | right |
> + // left and right remain, but mid is deleted
> + inode_data_vec left, mid, right;
> + left.data_range = {data_vec.data_range.beg, cap.beg};
> + mid.data_range = cap;
> + right.data_range = {cap.end, data_vec.data_range.end};
> + auto right_beg_shift = right.data_range.beg - data_vec.data_range.beg;
> + auto mid_beg_shift = mid.data_range.beg - data_vec.data_range.beg;
> + std::visit(overloaded {
> + [&](inode_data_vec::in_mem_data& mem) {
> + left.data_location = inode_data_vec::in_mem_data {mem.data.share(0, left.data_range.size())};
> + mid.data_location = inode_data_vec::in_mem_data {
> + mem.data.share(mid_beg_shift, mid.data_range.size())
> + };
> + right.data_location = inode_data_vec::in_mem_data {
> + mem.data.share(right_beg_shift, right.data_range.size())
> + };
> + },
> + [&](inode_data_vec::on_disk_data& disk_data) {
> + left.data_location = disk_data;
> + mid.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + mid_beg_shift};
> + right.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + right_beg_shift};
> + },
> + [&](inode_data_vec::hole_data&) {
> + left.data_location = inode_data_vec::hole_data {};
> + mid.data_location = inode_data_vec::hole_data {};
> + right.data_location = inode_data_vec::hole_data {};
> + },
> + }, data_vec.data_location);
> +
> + // Save new data vectors
> + if (not left.data_range.is_empty()) {
> + data.emplace(left.data_range.beg, std::move(left));
> + }
> + if (not right.data_range.is_empty()) {
> + data.emplace(right.data_range.beg, std::move(right));
> + }

We need to undo this is cut_data_vec_processor fails. It also means that
cut_data_vec_processor has to commit to strong exception guarantees - if
it throws, it must undo everything it has done for this element.

> +
> + // Process deleted vector
> + cut_data_vec_processor(std::move(mid));
> + }
> + }

Very nice.

> +
> + // Executes @p execute_on_data_ranges_processor on each data vector that is a subset of @p data_range.
> + // Data vectors on the edges are appropriately trimmed before passed to the function.
> + template<class Func>

Again, please add a contraint, and consider using a noncopyable_function
instead.

> + void execute_on_data_range(file_range range, Func&& execute_on_data_range_processor) {
> + static_assert(std::is_invocable_v<Func, inode_data_vec>);
> + auto it = data.lower_bound(range.beg);
> + if (it != data.begin() and are_intersecting(range, prev(it)->second.data_range)) {
> + --it;
> + }
> +
> + while (it != data.end() and are_intersecting(range, it->second.data_range)) {
> + auto& data_vec = (it++)->second;
> + const auto cap = intersection(range, data_vec.data_range);
> + if (cap == data_vec.data_range) {
> + // Fully intersects => execute
> + execute_on_data_range_processor(data_vec.share_copy());
> + continue;
> + }
> +
> + inode_data_vec mid;
> + mid.data_range = std::move(cap);
> + auto mid_beg_shift = mid.data_range.beg - data_vec.data_range.beg;
> + std::visit(overloaded {
> + [&](inode_data_vec::in_mem_data& mem) {
> + mid.data_location = inode_data_vec::in_mem_data {
> + mem.data.share(mid_beg_shift, mid.data_range.size())
> + };
> + },
> + [&](inode_data_vec::on_disk_data& disk_data) {
> + mid.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + mid_beg_shift};
> + },
> + [&](inode_data_vec::hole_data&) {
> + mid.data_location = inode_data_vec::hole_data {};
> + },
> + }, data_vec.data_location);
> +
> + // Execute on middle range
> + execute_on_data_range_processor(std::move(mid));
> + }
> + }
> + };
> +
> + std::variant<directory, file> contents;
> +

Please move to top-of-class and make private.

> + bool is_linked() const noexcept {
> + return directories_containing_file != 0;
> + }
> +
> + bool is_open() const noexcept {
> + return opened_files_count != 0;
> + }
> +
> + constexpr bool is_directory() const noexcept { return std::holds_alternative<directory>(contents); }
> +
> + // These are noexcept because invalid access is a bug not an error
> + constexpr directory& get_directory() & noexcept { return std::get<directory>(contents); }
> + constexpr const directory& get_directory() const & noexcept { return std::get<directory>(contents); }
> + constexpr directory&& get_directory() && noexcept { return std::move(std::get<directory>(contents)); }
> +
> + constexpr bool is_file() const noexcept { return std::holds_alternative<file>(contents); }
> +
> + // These are noexcept because invalid access is a bug not an error
> + constexpr file& get_file() & noexcept { return std::get<file>(contents); }
> + constexpr const file& get_file() const & noexcept { return std::get<file>(contents); }
> + constexpr file&& get_file() && noexcept { return std::move(std::get<file>(contents)); }

> +};
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
> new file mode 100644
> index 00000000..44c2a1c7
> --- /dev/null
> +++ b/src/fs/metadata_disk_entries.hh
> @@ -0,0 +1,63 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include "fs/cluster.hh"
> +#include "fs/inode.hh"
> +#include "fs/unix_metadata.hh"

> +
> +namespace seastar::fs {
> +

> +enum ondisk_type : uint8_t {
> + INVALID = 0,
> + CHECKPOINT,
> + NEXT_METADATA_CLUSTER,
> +};
> +
> +struct ondisk_checkpoint {
> + // The disk format is as follows:
> + // | ondisk_checkpoint | .............................. |
> + // | data |
> + // |<-- checkpointed_data_length -->|
> + // ^
> + // ______________________________________________/
> + // /
> + // there ends checkpointed data and (next checkpoint begins or metadata in the current cluster end)
> + //
> + // CRC is calculated from byte sequence | data | checkpointed_data_length |
> + // E.g. if the data consist of bytes

> "abcd" and checkpointed_data_length of bytes "xyz" then the byte sequence
> + // would be "abcdxyz"
> + uint32_t crc32_code;
> + unit_size_t checkpointed_data_length;
> +} __attribute__((packed));

You can use the nicer [[gnu::packed]].

Let's also add a uuid unique to the filesystem instance, so we don't
pick up a valid log entry after we reformat a filesystem (if we do,
we'll reject it because the uuids don't match).

> +
> +struct ondisk_next_metadata_cluster {
> + cluster_id_t cluster_id; // metadata log continues there
> +} __attribute__((packed));
> +
> +template<typename T>
> +constexpr size_t ondisk_entry_size(const T& entry) noexcept {
> + static_assert(std::is_same_v<T, ondisk_next_metadata_cluster>, "ondisk entry size not defined for given type");
> + return sizeof(ondisk_type) + sizeof(entry);

> +}
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
> new file mode 100644
> index 00000000..c10852a3
> --- /dev/null
> +++ b/src/fs/metadata_log.hh
> @@ -0,0 +1,295 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> + */
> +
> +#pragma once
> +

> +#include "fs/cluster.hh"
> +#include "fs/cluster_allocator.hh"
> +#include "fs/inode.hh"
> +#include "fs/inode_info.hh"
> +#include "fs/metadata_disk_entries.hh"
> +#include "fs/metadata_to_disk_buffer.hh"
> +#include "fs/units.hh"
> +#include "fs/unix_metadata.hh"
> +#include "fs/value_shared_lock.hh"
> +#include "seastar/core/file-types.hh"
> +#include "seastar/core/future-util.hh"
> +#include "seastar/core/future.hh"
> +#include "seastar/core/shared_future.hh"
> +#include "seastar/core/shared_ptr.hh"
> +#include "seastar/core/temporary_buffer.hh"
> +#include "seastar/fs/exceptions.hh"
> +
> +#include <chrono>
> +#include <cstddef>
> +#include <exception>
> +#include <type_traits>
> +#include <utility>
> +#include <variant>

> +
> +namespace seastar::fs {
> +

> +class metadata_log {
> + block_device _device;
> + const unit_size_t _cluster_size;
> + const unit_size_t _alignment;
> +
> + // Takes care of writing current cluster of serialized metadata log entries to device
> + shared_ptr<metadata_to_disk_buffer> _curr_cluster_buff;
> + shared_future<> _background_futures = now();
> +
> + // In memory metadata
> + cluster_allocator _cluster_allocator;
> + std::map<inode_t, inode_info> _inodes;
> +

unordered_map?

> inode_t _root_dir;
> + shard_inode_allocator _inode_allocator;
> +
> + // Locks are used to ensure metadata consistency while allowing concurrent usage.
> + //
> + // Whenever one wants to create or delete inode or directory entry, one has to acquire appropriate unique lock for
> + // the inode / dir entry that will appear / disappear and only after locking that operation should take place.
> + // Shared locks should be used only to ensure that an inode / dir entry won't disappear / appear, while some action
> + // is performed. Therefore, unique locks ensure that resource is not used by anyone else.
> + //
> + // IMPORTANT: if an operation needs to acquire more than one lock, it has to be done with *one* call to
> + // locks::with_locks() because it is ensured there that a deadlock-free locking order is used (for details see
> + // that function).
> + //
> + // Examples:
> + // - To create file we have to take shared lock (SL) on the directory to which we add a dir entry and
> + // unique lock (UL) on the added entry in this directory. SL is taken because the directory should not disappear.
> + // UL is taken, because we do not want the entry to appear while we are creating it.
> + // - To read or write to a file, a SL is acquired on its inode and then the operation is performed.
> + class locks {
> + value_shared_lock<inode_t> _inode_locks;
> + value_shared_lock<std::pair<inode_t, std::string>> _dir_entry_locks;
> +
> + public:
> + struct shared {
> + inode_t inode;
> + std::optional<std::string> dir_entry;
> + };
> +
> + template<class T>
> + static constexpr bool is_shared = std::is_same_v<std::remove_cv_t<std::remove_reference_t<T>>, shared>;
> +
> + struct unique {
> + inode_t inode;
> + std::optional<std::string> dir_entry;
> + };
> +
> + template<class T>
> + static constexpr bool is_unique = std::is_same_v<std::remove_cv_t<std::remove_reference_t<T>>, unique>;
> +
> + template<class Kind, class Func>
> + auto with_lock(Kind kind, Func&& func) {
> + static_assert(is_shared<Kind> or is_unique<Kind>);
> + if constexpr (is_shared<Kind>) {
> + if (kind.dir_entry.has_value()) {
> + return _dir_entry_locks.with_shared_on({kind.inode, std::move(*kind.dir_entry)},
> + std::forward<Func>(func));
> + } else {
> + return _inode_locks.with_shared_on(kind.inode, std::forward<Func>(func));
> + }
> + } else {
> + if (kind.dir_entry.has_value()) {
> + return _dir_entry_locks.with_lock_on({kind.inode, std::move(*kind.dir_entry)},
> + std::forward<Func>(func));
> + } else {
> + return _inode_locks.with_lock_on(kind.inode, std::forward<Func>(func));
> + }
> + }
> + }
> +
> + private:
> + template<class Kind1, class Kind2, class Func>
> + auto with_locks_in_order(Kind1 kind1, Kind2 kind2, Func func) {
> + // Func is not an universal reference because we will have to store it
> + return with_lock(std::move(kind1), [this, kind2 = std::move(kind2), func = std::move(func)] () mutable {
> + return with_lock(std::move(kind2), std::move(func));
> + });
> + };
> +
> + public:
> +
> + template<class Kind1, class Kind2, class Func>
> + auto with_locks(Kind1 kind1, Kind2 kind2, Func&& func) {
> + static_assert(is_shared<Kind1> or is_unique<Kind1>);
> + static_assert(is_shared<Kind2> or is_unique<Kind2>);
> +
> + // Locking order is as follows: kind with lower tuple (inode, dir_entry) goes first.
> + // This order is linear and we always lock in one direction, so the graph of locking relations (A -> B iff
> + // lock on A is acquired and lock on B is acquired / being acquired) makes a DAG. Thus, deadlock is
> + // impossible, as it would require a cycle to appear.
> + std::pair<inode_t, std::optional<std::string>&> k1 {kind1.inode, kind1.dir_entry};
> + std::pair<inode_t, std::optional<std::string>&> k2 {kind2.inode, kind2.dir_entry};
> + if (k1 < k2) {
> + return with_locks_in_order(std::move(kind1), std::move(kind2), std::forward<Func>(func));
> + } else {
> + return with_locks_in_order(std::move(kind2), std::move(kind1), std::forward<Func>(func));
> + }

You can std::swap(kind1, kind2) instead of duplicating the call. This
will reduce the compiler's motivation to inline two calls to func,
bloating the code.

> + }
> + } _locks;

Very nice, this locking system.

> +
> + // TODO: for compaction: keep some set(?) of inode_data_vec, so that we can keep track of clusters that have lowest
> + // utilization (up-to-date data)
> + // TODO: for compaction: keep estimated metadata log size (that would take when written to disk) and
> + // the real size of metadata log taken on disk to allow for detecting when compaction
> +
> + friend class metadata_log_bootstrap;
> +
> +public:
> + metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
> + shared_ptr<metadata_to_disk_buffer> cluster_buff);
> +
> + metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment);
> +
> + metadata_log(const metadata_log&) = delete;
> + metadata_log& operator=(const metadata_log&) = delete;
> + metadata_log(metadata_log&&) = default;
> +
> + future<> bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
> + fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
> +
> + future<> shutdown();
> +
> +private:
> + bool inode_exists(inode_t inode) const noexcept {
> + return _inodes.count(inode) != 0;

> + }
> +
> + template<class Func>

> + void schedule_background_task(Func&& task) {
> + _background_futures = when_all_succeed(_background_futures.get_future(), std::forward<Func>(task));
> + }

Please add a TODO to limit the amount of background work. This code
allows it to grow without bounds, so if we're not able to retire this
work quickly enough we can go OOM.

> +
> + void schedule_flush_of_curr_cluster();
> +
> + enum class flush_result {
> + DONE,
> + NO_SPACE
> + };
> +
> + [[nodiscard]] flush_result schedule_flush_of_curr_cluster_and_change_it_to_new_one();
> +
> + future<> flush_curr_cluster();
> +
> + enum class append_result {
> + APPENDED,
> + TOO_BIG,
> + NO_SPACE
> + };
> +
> + template<class... Args>
> + [[nodiscard]] append_result append_ondisk_entry(Args&&... args) {
> + using AR = append_result;
> + // TODO: maybe check for errors on _background_futures to expose previous errors?
> + switch (_curr_cluster_buff->append(args...)) {
> + case metadata_to_disk_buffer::APPENDED:
> + return AR::APPENDED;
> + case metadata_to_disk_buffer::TOO_BIG:
> + break;
> + }
> +
> + switch (schedule_flush_of_curr_cluster_and_change_it_to_new_one()) {
> + case flush_result::NO_SPACE:
> + return AR::NO_SPACE;
> + case flush_result::DONE:
> + break;
> + }
> +
> + switch (_curr_cluster_buff->append(args...)) {
> + case metadata_to_disk_buffer::APPENDED:
> + return AR::APPENDED;
> + case metadata_to_disk_buffer::TOO_BIG:
> + return AR::TOO_BIG;
> + }
> +
> + __builtin_unreachable();
> + }
> +
> + enum class path_lookup_error {
> + NOT_ABSOLUTE, // a path is not absolute
> + NO_ENTRY, // no such file or directory
> + NOT_DIR, // a component used as a directory in path is not, in fact, a directory
> + };
> +
> + std::variant<inode_t, path_lookup_error> do_path_lookup(const std::string& path) const noexcept;
> +
> + // It is safe for @p path to be a temporary (there is no need to worry about its lifetime)
> + future<inode_t> path_lookup(const std::string& path) const;
> +
> +public:
> + template<class Func>

As usual, constraint, and consider noncopyable_function.

I see you support two signatures, so you may have to split into two
functions.

> + future<> iterate_directory(const std::string& dir_path, Func func) {
> + static_assert(std::is_invocable_r_v<future<>, Func, const std::string&> or
> + std::is_invocable_r_v<future<stop_iteration>, Func, const std::string&>);
> + auto convert_func = [&]() -> decltype(auto) {

> + if constexpr (std::is_invocable_r_v<future<stop_iteration>, Func, const std::string&>) {
> + return std::move(func);
> + } else {
> + return [func = std::move(func)]() -> future<stop_iteration> {
> + return func().then([] {

where's the string parameter?

> + return stop_iteration::no;
> + });
> + };
> + }
> + };
> + return path_lookup(dir_path).then([this, func = convert_func()](inode_t dir_inode) {
> + return do_with(std::move(func), std::string {}, [this, dir_inode](auto& func, auto& prev_entry) {
> + auto it = _inodes.find(dir_inode);

> + if (it == _inodes.end()) {

> + return now(); // Directory disappeared
> + }
> + if (not it->second.is_directory()) {
> + return make_exception_future(path_component_not_directory_exception());
> + }
> +
> + return repeat([this, dir_inode, &prev_entry, &func] {
> + auto it = _inodes.find(dir_inode);
> + if (it == _inodes.end()) {
> + return make_ready_future<stop_iteration>(stop_iteration::yes); // Directory disappeared
> + }
> + assert(it->second.is_directory() and "Directory cannot become a file");
> + auto& dir = it->second.get_directory();
> +
> + auto entry_it = dir.entries.upper_bound(prev_entry);
> + if (entry_it == dir.entries.end()) {
> + return make_ready_future<stop_iteration>(stop_iteration::yes); // No more entries
> + }
> +
> + prev_entry = entry_it->first;
> + return func(static_cast<const std::string&>(prev_entry));

Why the cast?

> + });
> + });
> + });
> + }
> +

> + // Returns size of the file or throws exception iff @p inode is invalid

> + file_offset_t file_size(inode_t inode) const;
> +
> + // All disk-related errors will be exposed here
> + future<> flush_log() {
> + return flush_curr_cluster();

> + }
> +};
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
> new file mode 100644
> index 00000000..5da79631
> --- /dev/null
> +++ b/src/fs/metadata_log_bootstrap.hh
> @@ -0,0 +1,123 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> + */
> +
> +#pragma once
> +

> +#include "fs/bitwise.hh"
> +#include "fs/cluster.hh"
> +#include "fs/inode.hh"
> +#include "fs/inode_info.hh"
> +#include "fs/metadata_disk_entries.hh"
> +#include "fs/metadata_to_disk_buffer.hh"
> +#include "fs/units.hh"
> +#include "fs/metadata_log.hh"
> +#include "seastar/core/do_with.hh"
> +#include "seastar/core/future-util.hh"
> +#include "seastar/core/future.hh"
> +#include "seastar/core/temporary_buffer.hh"
> +
> +#include <boost/crc.hpp>
> +#include <cstddef>
> +#include <cstring>
> +#include <unordered_set>
> +#include <variant>

> +
> +namespace seastar::fs {
> +

> +// TODO: add a comment about what it is
> +class data_reader {
> + const uint8_t* _data = nullptr;
> + size_t _size = 0;
> + size_t _pos = 0;
> + size_t _last_checkpointed_pos = 0;
> +
> +public:
> + data_reader() = default;
> +
> + data_reader(const uint8_t* data, size_t size) : _data(data), _size(size) {}
> +
> + size_t curr_pos() const noexcept { return _pos; }
> +
> + size_t last_checkpointed_pos() const noexcept { return _last_checkpointed_pos; }
> +
> + size_t bytes_left() const noexcept { return _size - _pos; }
> +
> + void align_curr_pos(size_t alignment) noexcept { _pos = round_up_to_multiple_of_power_of_2(_pos, alignment); }
> +
> + void checkpoint_curr_pos() noexcept { _last_checkpointed_pos = _pos; }
> +
> + // Returns whether the reading was successful
> + bool read(void* destination, size_t size);
> +
> + // Returns whether the reading was successful
> + template<class T>
> + bool read_entry(T& entry) noexcept {
> + return read(&entry, sizeof(entry));
> + }
> +
> + // Returns whether the reading was successful
> + bool read_string(std::string& str, size_t size);
> +
> + std::optional<temporary_buffer<uint8_t>> read_tmp_buff(size_t size);
> +
> + // Returns whether the processing was successful
> + bool process_crc_without_reading(boost::crc_32_type& crc, size_t size);
> +
> + std::optional<data_reader> extract(size_t size);
> +};
> +
> +class metadata_log_bootstrap {
> + metadata_log& _metadata_log;
> + cluster_range _available_clusters;
> + std::unordered_set<cluster_id_t> _taken_clusters;
> + std::optional<cluster_id_t> _next_cluster;
> + temporary_buffer<uint8_t> _curr_cluster_data;
> + data_reader _curr_cluster;
> + data_reader _curr_checkpoint;
> +
> + metadata_log_bootstrap(metadata_log& metadata_log, cluster_range available_clusters);
> +
> + future<> bootstrap(cluster_id_t first_metadata_cluster_id, fs_shard_id_t fs_shards_pool_size,
> + fs_shard_id_t fs_shard_id);
> +
> + future<> bootstrap_cluster(cluster_id_t curr_cluster);
> +
> + static auto invalid_entry_exception() {
> + return make_exception_future<>(std::runtime_error("Invalid metadata log entry"));
> + }
> +
> + future<> bootstrap_read_cluster();
> +
> + // Returns whether reading and checking was successful
> + bool read_and_check_checkpoint();
> +
> + future<> bootstrap_checkpointed_data();
> +
> + future<> bootstrap_next_metadata_cluster();
> +
> + bool inode_exists(inode_t inode);
> +
> +public:
> + static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
> + cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);

> +};
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
> new file mode 100644
> index 00000000..bd60f4f3
> --- /dev/null
> +++ b/src/fs/metadata_to_disk_buffer.hh
> @@ -0,0 +1,158 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include "fs/bitwise.hh"
> +#include "fs/metadata_disk_entries.hh"
> +#include "fs/to_disk_buffer.hh"
> +
> +#include <boost/crc.hpp>

> +
> +namespace seastar::fs {
> +

> +// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
> +// in order to finish construction.
> +class metadata_to_disk_buffer : protected to_disk_buffer {
> + boost::crc_32_type _crc;

Please use crc32c which has hardware implementations. Later, we can
contribute the optimized implementation from Scylla.

> +
> +public:
> + metadata_to_disk_buffer() = default;
> +
> + using to_disk_buffer::init; // Explicitly stated that stays the same
> +
> + virtual shared_ptr<metadata_to_disk_buffer> virtual_constructor() const {
> + return make_shared<metadata_to_disk_buffer>();
> + }

What's this?

> +
> + /**
> + * @brief Inits object, leaving it in state as if just after flushing with unflushed data end at
> + * @p cluster_beg_offset
> + *
> + * @param aligned_max_size size of the buffer, must be aligned
> + * @param alignment write alignment
> + * @param cluster_beg_offset disk offset of the beginning of the cluster
> + * @param metadata_end_pos position at which valid metadata ends: valid metadata range: [0, @p metadata_end_pos)
> + */
> + virtual void init_from_bootstrapped_cluster(size_t aligned_max_size, unit_size_t alignment,
> + disk_offset_t cluster_beg_offset, size_t metadata_end_pos) {

> + assert(is_power_of_2(alignment));
> + assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
> + assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);

> + assert(aligned_max_size >= sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
> + assert(alignment >= sizeof(ondisk_type) + sizeof(ondisk_checkpoint) + sizeof(ondisk_type) +
> + sizeof(ondisk_next_metadata_cluster) and
> + "We always need to be able to pack at least a checkpoint and next_metadata_cluster entry to the last "
> + "data flush in the cluster");
> + assert(metadata_end_pos < aligned_max_size);
> +
> + _max_size = aligned_max_size;
> + _alignment = alignment;
> + _cluster_beg_offset = cluster_beg_offset;
> + auto aligned_pos = round_up_to_multiple_of_power_of_2(metadata_end_pos, _alignment);
> + _unflushed_data = {aligned_pos, aligned_pos};

> + _buff = decltype(_buff)::aligned(_alignment, _max_size);

> +
> + start_new_unflushed_data();
> + }
> +

> +protected:

> + void start_new_unflushed_data() noexcept override {
> + if (bytes_left() < sizeof(ondisk_type) + sizeof(ondisk_checkpoint) + sizeof(ondisk_type) +
> + sizeof(ondisk_next_metadata_cluster)) {
> + assert(bytes_left() == 0); // alignment has to be big enough to hold checkpoint and next_metadata_cluster
> + return; // No more space
> + }
> +

> + ondisk_type type = INVALID;
> + ondisk_checkpoint checkpoint;
> + std::memset(&checkpoint, 0, sizeof(checkpoint));
> +
> + to_disk_buffer::append_bytes(&type, sizeof(type));
> + to_disk_buffer::append_bytes(&checkpoint, sizeof(checkpoint));
> +
> + _crc.reset();

> + }
> +
> + void prepare_unflushed_data_for_flush() noexcept override {

> + // Make checkpoint valid
> + constexpr ondisk_type checkpoint_type = CHECKPOINT;
> + size_t checkpoint_pos = _unflushed_data.beg + sizeof(checkpoint_type);
> + ondisk_checkpoint checkpoint;
> + checkpoint.checkpointed_data_length = _unflushed_data.end - checkpoint_pos - sizeof(checkpoint);
> + _crc.process_bytes(&checkpoint.checkpointed_data_length, sizeof(checkpoint.checkpointed_data_length));
> + checkpoint.crc32_code = _crc.checksum();
> +
> + std::memcpy(_buff.get_write() + _unflushed_data.beg, &checkpoint_type, sizeof(checkpoint_type));
> + std::memcpy(_buff.get_write() + checkpoint_pos, &checkpoint, sizeof(checkpoint));
> + }
> +
> +public:
> + using to_disk_buffer::bytes_left_after_flush_if_done_now; // Explicitly stated that stays the same
> +
> +private:
> + void append_bytes(const void* data, size_t len) noexcept override {
> + to_disk_buffer::append_bytes(data, len);
> + _crc.process_bytes(data, len);
> + }
> +
> +public:
> + enum append_result {
> + APPENDED,
> + TOO_BIG,
> + };
> +
> + [[nodiscard]] virtual append_result append(const ondisk_next_metadata_cluster& next_metadata_cluster) noexcept {
> + ondisk_type type = NEXT_METADATA_CLUSTER;
> + if (bytes_left() < ondisk_entry_size(next_metadata_cluster)) {

> + return TOO_BIG;
> + }
> +

> + append_bytes(&type, sizeof(type));
> + append_bytes(&next_metadata_cluster, sizeof(next_metadata_cluster));

> + return APPENDED;
> + }
> +

> + using to_disk_buffer::bytes_left;
> +
> +protected:
> + bool fits_for_append(size_t bytes_no) const noexcept {
> + // We need to reserve space for the next metadata cluster entry
> + return (bytes_left() >= bytes_no + sizeof(ondisk_type) + sizeof(ondisk_next_metadata_cluster));
> + }
> +
> +private:
> + template<class T>
> + [[nodiscard]] append_result append_simple(ondisk_type type, const T& entry) noexcept {
> + if (not fits_for_append(ondisk_entry_size(entry))) {

> + return TOO_BIG;
> + }
> +

> + append_bytes(&type, sizeof(type));
> + append_bytes(&entry, sizeof(entry));

> + return APPENDED;
> + }
> +

> +public:
> + using to_disk_buffer::flush_to_disk;

> +};
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/to_disk_buffer.hh b/src/fs/to_disk_buffer.hh
> new file mode 100644
> index 00000000..612f26d2
> --- /dev/null
> +++ b/src/fs/to_disk_buffer.hh
> @@ -0,0 +1,138 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include "fs/bitwise.hh"
> +#include "fs/units.hh"

> +#include "seastar/core/future.hh"
> +#include "seastar/core/temporary_buffer.hh"
> +#include "seastar/fs/block_device.hh"
> +

> +#include <cstring>

> +
> +namespace seastar::fs {
> +

> +// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
> +// in order to finish construction.
> +class to_disk_buffer {

Please split this into a separate patch. It's big enough and it's hard
to understand the preceding code without it.

> +protected:
> + temporary_buffer<uint8_t> _buff;
> + size_t _max_size = 0;
> + unit_size_t _alignment = 0;
> + disk_offset_t _cluster_beg_offset = 0; // disk offset that corresponds to _buff.begin()
> + range<size_t> _unflushed_data = {0, 0}; // range of unflushed bytes in _buff
> +
> +public:
> + to_disk_buffer() = default;
> +
> + to_disk_buffer(const to_disk_buffer&) = delete;
> + to_disk_buffer& operator=(const to_disk_buffer&) = delete;
> + to_disk_buffer(to_disk_buffer&&) = default;
> + to_disk_buffer& operator=(to_disk_buffer&&) = default;
> +
> + // Total number of bytes appended cannot exceed @p aligned_max_size.
> + // @p cluster_beg_offset is the disk offset of the beginning of the cluster.
> + virtual void init(size_t aligned_max_size, unit_size_t alignment, disk_offset_t cluster_beg_offset) {

> + assert(is_power_of_2(alignment));
> + assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
> + assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
> +

> + _max_size = aligned_max_size;
> + _alignment = alignment;
> + _cluster_beg_offset = cluster_beg_offset;
> + _unflushed_data = {0, 0};

> + _buff = decltype(_buff)::aligned(_alignment, _max_size);
> + start_new_unflushed_data();
> + }
> +

Why is this virtual? I'm missing some theory of operation.

> + virtual ~to_disk_buffer() = default;
> +
> + /**
> + * @brief Writes buffered (unflushed) data to disk and starts a new unflushed data if there is enough space
> + * IMPORTANT: using this buffer before call to flush_to_disk() completes is perfectly OK
> + * @details After each flush we align the offset at which the new unflushed data is continued. This is very
> + * important, as it ensures that consecutive flushes, as their underlying write operations to a block device,
> + * do not overlap. If the writes overlapped, it would be possible that they would be written in the reverse order
> + * corrupting the on-disk data.
> + *
> + * @param device output device
> + */
> + virtual future<> flush_to_disk(block_device device) {
> + prepare_unflushed_data_for_flush();
> + // Data layout overview:
> + // |.........................|00000000000000000000000|
> + // ^ _unflushed_data.beg ^ _unflushed_data.end ^ real_write.end
> + // (aligned) (maybe unaligned) (aligned)
> + // == real_write.beg == new _unflushed_data.beg
> + // |<------ padding ------>|

> + assert(mod_by_power_of_2(_unflushed_data.beg, _alignment) == 0);
> + range real_write = {
> + _unflushed_data.beg,
> + round_up_to_multiple_of_power_of_2(_unflushed_data.end, _alignment),
> + };

> + // Pad buffer with zeros till alignment
> + range padding = {_unflushed_data.end, real_write.end};
> + std::memset(_buff.get_write() + padding.beg, 0, padding.size());

> +
> + // Make sure the buffer is usable before returning from this function
> + _unflushed_data = {real_write.end, real_write.end};
> + start_new_unflushed_data();
> +

> + return device.write(_cluster_beg_offset + real_write.beg, _buff.get_write() + real_write.beg, real_write.size())
> + .then([real_write](size_t written_bytes) {
> + if (written_bytes != real_write.size()) {
> + return make_exception_future<>(std::runtime_error("Partial write"));
> + // TODO: maybe add some way to retry write, because once the buffer is corrupt nothing can be done now

> + }
> +
> + return now();
> + });
> + }
> +

> +protected:
> + // May be called before the flushing previous fragment is
> + virtual void start_new_unflushed_data() noexcept {}
> +
> + virtual void prepare_unflushed_data_for_flush() noexcept {}
> +
> +public:
> + virtual void append_bytes(const void* data, size_t len) noexcept {
> + assert(len <= bytes_left());
> + std::memcpy(_buff.get_write() + _unflushed_data.end, data, len);

> + _unflushed_data.end += len;
> + }
> +

> + // Returns maximum number of bytes that may be written to buffer without calling reset()
> + virtual size_t bytes_left() const noexcept { return _max_size - _unflushed_data.end; }
> +

It looks wrong to have such a simple function virtual, but I admit I'm
lost here.

> + virtual size_t bytes_left_after_flush_if_done_now() const noexcept {
> + return _max_size - round_up_to_multiple_of_power_of_2(_unflushed_data.end, _alignment);
> + }
> +
> + // Returns disk offset of the place where the first byte of next appended bytes would be after flush
> + // TODO: maybe better name for that function? Or any other method to extract that data?
> + virtual disk_offset_t current_disk_offset() const noexcept {
> + return _cluster_beg_offset + _unflushed_data.end;

> + }
> +};
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/unix_metadata.hh b/src/fs/unix_metadata.hh
> new file mode 100644
> index 00000000..6f634044
> --- /dev/null
> +++ b/src/fs/unix_metadata.hh
> @@ -0,0 +1,40 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2019 ScyllaDB

> + */
> +
> +#pragma once
> +

> +#include "seastar/core/file-types.hh"
> +
> +#include <cstdint>
> +#include <sys/types.h>

> +
> +namespace seastar::fs {
> +

> +struct unix_metadata {

> + file_permissions perms;
> + uid_t uid;
> + gid_t gid;

If these go do disk, they should have an implementation-independent
definition, and the thing should be packed.

> + uint64_t btime_ns;
> + uint64_t mtime_ns;
> + uint64_t ctime_ns;
> +};
> +

This can also go into its own patch.

> +} // namespace seastar::fs

> diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
> new file mode 100644
> index 00000000..6e29f2e5
> --- /dev/null
> +++ b/src/fs/metadata_log.cc
> @@ -0,0 +1,222 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> +#include "fs/cluster.hh"
> +#include "fs/cluster_allocator.hh"
> +#include "fs/inode.hh"
> +#include "fs/inode_info.hh"
> +#include "fs/metadata_disk_entries.hh"
> +#include "fs/metadata_log.hh"
> +#include "fs/metadata_log_bootstrap.hh"
> +#include "fs/metadata_to_disk_buffer.hh"
> +#include "fs/path.hh"
> +#include "fs/units.hh"
> +#include "fs/unix_metadata.hh"
> +#include "seastar/core/aligned_buffer.hh"
> +#include "seastar/core/do_with.hh"
> +#include "seastar/core/file-types.hh"
> +#include "seastar/core/future-util.hh"
> +#include "seastar/core/future.hh"
> +#include "seastar/core/shared_mutex.hh"
> +#include "seastar/fs/overloaded.hh"
> +
> +#include <boost/crc.hpp>
> +#include <boost/range/irange.hpp>
> +#include <chrono>
> +#include <cstddef>
> +#include <cstdint>
> +#include <cstring>
> +#include <limits>
> +#include <stdexcept>
> +#include <string_view>
> +#include <unordered_set>
> +#include <variant>

> +
> +namespace seastar::fs {
> +

> +metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t alignment,
> + shared_ptr<metadata_to_disk_buffer> cluster_buff)
> +: _device(std::move(device))
> +, _cluster_size(cluster_size)
> +, _alignment(alignment)
> +, _curr_cluster_buff(std::move(cluster_buff))
> +, _cluster_allocator({}, {})
> +, _inode_allocator(1, 0) {
> +

We generally double-indent such.

> assert(is_power_of_2(alignment));
> + assert(cluster_size > 0 and cluster_size % alignment == 0);
> +}
> +
> +metadata_log::metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment)
> +: metadata_log(std::move(device), cluster_size, alignment,
> + make_shared<metadata_to_disk_buffer>()) {}
> +
> +future<> metadata_log::bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
> + fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
> + return metadata_log_bootstrap::bootstrap(*this, root_dir, first_metadata_cluster_id, available_clusters,
> + fs_shards_pool_size, fs_shard_id);
> +}
> +
> +future<> metadata_log::shutdown() {
> + return flush_log().then([this] {
> + return _device.close();
> + });
> +}
> +
> +void metadata_log::schedule_flush_of_curr_cluster() {
> + // Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
> + schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {
> + return crr_clstr_bf->flush_to_disk(*device);
> + }));
> +}
> +
> +future<> metadata_log::flush_curr_cluster() {
> + if (_curr_cluster_buff->bytes_left_after_flush_if_done_now() == 0) {
> + switch (schedule_flush_of_curr_cluster_and_change_it_to_new_one()) {
> + case flush_result::NO_SPACE:
> + return make_exception_future(no_more_space_exception());
> + case flush_result::DONE:
> + break;
> + }
> + } else {
> + schedule_flush_of_curr_cluster();
> + }
> +
> + return _background_futures.get_future();
> +}
> +
> +metadata_log::flush_result metadata_log::schedule_flush_of_curr_cluster_and_change_it_to_new_one() {
> + auto next_cluster = _cluster_allocator.alloc();
> + if (not next_cluster) {
> + // Here metadata log dies, we cannot even flush current cluster because from there we won't be able to recover
> + // TODO: ^ add protection from it and take it into account during compaction

Yes, we need two thresholds for allocation: data cluster allocations
should fail early, with a reserve that is enough for compaction for
metadata clusters. Perhaps we need three thresholds, one for data
clusters, one for new metadata clusters (allowing file deletions only),
and one for metadata log compation (allowing allocating the very last
cluster on disk).

> + return flush_result::NO_SPACE;
> + }
> +
> + auto append_res = _curr_cluster_buff->append(ondisk_next_metadata_cluster {*next_cluster});
> + assert(append_res == metadata_to_disk_buffer::APPENDED);
> + schedule_flush_of_curr_cluster();
> +
> + // Make next cluster the current cluster to allow writing next metadata entries before flushing finishes
> + _curr_cluster_buff->virtual_constructor();
> + _curr_cluster_buff->init(_cluster_size, _alignment,
> + cluster_id_to_offset(*next_cluster, _cluster_size));
> + return flush_result::DONE;
> +}
> +
> +std::variant<inode_t, metadata_log::path_lookup_error> metadata_log::do_path_lookup(const std::string& path) const noexcept {

Maybe the in-memory representation should be split from the metadata
log, they are both liable to grow.

> + if (path.empty() or path[0] != '/') {
> + return path_lookup_error::NOT_ABSOLUTE;
> + }
> +
> + std::vector<inode_t> components_stack = {_root_dir};
> + size_t beg = 0;
> + while (beg < path.size()) {
> + range component_range = {beg, path.find('/', beg)};
> + bool check_if_dir = false;
> + if (component_range.end == path.npos) {
> + component_range.end = path.size();
> + beg = path.size();
> + } else {
> + check_if_dir = true;
> + beg = component_range.end + 1; // Jump over '/'

Does this deal with "a////b"?

> + }
> +
> + std::string_view component(path.data() + component_range.beg, component_range.size());
> + // Process the component
> + if (component == "") {
> + continue;
> + } else if (component == ".") {
> + assert(component_range.beg > 0 and path[component_range.beg - 1] == '/' and "Since path is absolute we do not have to check if the current component is a directory");
> + continue;
> + } else if (component == "..") {
> + if (components_stack.size() > 1) { // Root dir cannot be popped
> + components_stack.pop_back();
> + }
> + } else {
> + auto dir_it = _inodes.find(components_stack.back());
> + assert(dir_it != _inodes.end() and "inode comes from some previous lookup (or is a root directory) hence dir_it has to be valid");
> + assert(dir_it->second.is_directory() and "every previous component is a directory and it was checked when they were processed");
> + auto& curr_dir = dir_it->second.get_directory();
> +
> + auto it = curr_dir.entries.find(component);
> + if (it == curr_dir.entries.end()) {
> + return path_lookup_error::NO_ENTRY;
> + }
> +
> + inode_t entry_inode = it->second;
> + if (check_if_dir) {
> + auto entry_it = _inodes.find(entry_inode);
> + assert(entry_it != _inodes.end() and "dir entries have to exist");
> + if (not entry_it->second.is_directory()) {
> + return path_lookup_error::NOT_DIR;
> + }
> + }
> +
> + components_stack.emplace_back(entry_inode);
> + }
> + }
> +
> + return components_stack.back();
> +}
> +
> +future<inode_t> metadata_log::path_lookup(const std::string& path) const {
> + auto lookup_res = do_path_lookup(path);
> + return std::visit(overloaded {
> + [](path_lookup_error error) {

> + switch (error) {
> + case path_lookup_error::NOT_ABSOLUTE:

> + return make_exception_future<inode_t>(path_is_not_absolute_exception());
> + case path_lookup_error::NO_ENTRY:
> + return make_exception_future<inode_t>(no_such_file_or_directory_exception());
> + case path_lookup_error::NOT_DIR:
> + return make_exception_future<inode_t>(path_component_not_directory_exception());

> + }
> + __builtin_unreachable();
> + },

> + [](inode_t inode) {
> + return make_ready_future<inode_t>(inode);
> + }
> + }, lookup_res);
> +}
> +
> +file_offset_t metadata_log::file_size(inode_t inode) const {

> + auto it = _inodes.find(inode);
> + if (it == _inodes.end()) {
> + throw invalid_inode_exception();

> + }
> +
> + return std::visit(overloaded {
> + [](const inode_info::file& file) {
> + return file.size();
> + },
> + [](const inode_info::directory&) -> file_offset_t {
> + throw invalid_inode_exception();

You can return the number of entries in a directory. ls -l doesn't throw
when one of the entries is a directory.

> + }
> + }, it->second.contents);
> +}
> +
> +// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
> +// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
> +// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
> +// hard if we write metadata to the last cluster and there is no enough room to write these delete operations. We have to
> +// guarantee that the filesystem is in a recoverable state then.

Yes, I posted some ideas earlier.

> +
> +} // namespace seastar::fs

> diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
> new file mode 100644
> index 00000000..926d79fe
> --- /dev/null
> +++ b/src/fs/metadata_log_bootstrap.cc
> @@ -0,0 +1,264 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> +#include "fs/bitwise.hh"
> +#include "fs/metadata_disk_entries.hh"
> +#include "fs/metadata_log_bootstrap.hh"
> +#include "seastar/util/log.hh"

> +
> +namespace seastar::fs {
> +

> +namespace {
> +logger mlogger("fs_metadata_bootstrap");
> +} // namespace
> +
> +bool data_reader::read(void* destination, size_t size) {
> + if (_pos + size > _size) {
> + return false;
> + }
> +
> + std::memcpy(destination, _data + _pos, size);
> + _pos += size;
> + return true;
> +}
> +
> +bool data_reader::read_string(std::string& str, size_t size) {
> + str.resize(size);
> + return read(str.data(), size);
> +}
> +
> +std::optional<temporary_buffer<uint8_t>> data_reader::read_tmp_buff(size_t size) {
> + if (_pos + size > _size) {

> + return std::nullopt;
> + }
> +

> + _pos += size;
> + return temporary_buffer<uint8_t>(_data + _pos - size, size);
> +}
> +
> +bool data_reader::process_crc_without_reading(boost::crc_32_type& crc, size_t size) {
> + if (_pos + size > _size) {
> + return false;
> + }
> +
> + crc.process_bytes(_data + _pos, size);
> + return true;
> +}
> +
> +std::optional<data_reader> data_reader::extract(size_t size) {
> + if (_pos + size > _size) {

> + return std::nullopt;
> + }
> +

> + _pos += size;
> + return data_reader(_data + _pos - size, size);
> +}
> +
> +metadata_log_bootstrap::metadata_log_bootstrap(metadata_log& metadata_log, cluster_range available_clusters)
> +: _metadata_log(metadata_log)
> +, _available_clusters(available_clusters)
> +, _curr_cluster_data(decltype(_curr_cluster_data)::aligned(metadata_log._alignment, metadata_log._cluster_size))
> +{}
> +
> +future<> metadata_log_bootstrap::bootstrap(cluster_id_t first_metadata_cluster_id, fs_shard_id_t fs_shards_pool_size,
> + fs_shard_id_t fs_shard_id) {
> + _next_cluster = first_metadata_cluster_id;
> + mlogger.debug(">>>> Started bootstraping <<<<");
> + return do_with((cluster_id_t)first_metadata_cluster_id, [this](cluster_id_t& last_cluster) {
> + return do_until([this] { return not _next_cluster.has_value(); }, [this, &last_cluster] {
> + cluster_id_t curr_cluster = *_next_cluster;
> + _next_cluster = std::nullopt;
> + bool inserted = _taken_clusters.emplace(curr_cluster).second;
> + assert(inserted); // TODO: check it in next_cluster record
> + last_cluster = curr_cluster;
> + return bootstrap_cluster(curr_cluster);
> + }).then([this, &last_cluster] {
> + mlogger.debug("Data bootstraping is done");
> + // Initialize _curr_cluster_buff
> + _metadata_log._curr_cluster_buff = _metadata_log._curr_cluster_buff->virtual_constructor();
> + mlogger.debug("Initializing _curr_cluster_buff: cluster {}, pos {}", last_cluster, _curr_cluster.last_checkpointed_pos());
> + _metadata_log._curr_cluster_buff->init_from_bootstrapped_cluster(_metadata_log._cluster_size,
> + _metadata_log._alignment, cluster_id_to_offset(last_cluster, _metadata_log._cluster_size),
> + _curr_cluster.last_checkpointed_pos());
> + });
> + }).then([this, fs_shards_pool_size, fs_shard_id] {
> + // Initialize _cluser_allocator
> + mlogger.debug("Initializing cluster allocator");
> + std::deque<cluster_id_t> free_clusters;
> + for (auto cid : boost::irange(_available_clusters.beg, _available_clusters.end)) {
> + if (_taken_clusters.count(cid) == 0) {
> + free_clusters.emplace_back(cid);
> + }
> + }
> + if (free_clusters.empty()) {
> + return make_exception_future(no_more_space_exception());
> + }
> + free_clusters.pop_front();
> +
> + mlogger.debug("free clusters: {}", free_clusters.size());
> + _metadata_log._cluster_allocator = cluster_allocator(std::move(_taken_clusters), std::move(free_clusters));
> +
> + // Reset _inode_allocator
> + std::optional<inode_t> max_inode_no;
> + if (not _metadata_log._inodes.empty()) {
> + max_inode_no =_metadata_log._inodes.rbegin()->first;
> + }
> + _metadata_log._inode_allocator = shard_inode_allocator(fs_shards_pool_size, fs_shard_id, max_inode_no);
> +
> + // TODO: what about orphaned inodes: maybe they are remnants of unlinked files and we need to delete them,
> + // or maybe not?

> + return now();
> + });

You can use a coroutine for this, I don't mind making it C++20-only. It
will be approximately 83,223.772 times more readable.

> +}
> +
> +future<> metadata_log_bootstrap::bootstrap_cluster(cluster_id_t curr_cluster) {
> + disk_offset_t curr_cluster_disk_offset = cluster_id_to_offset(curr_cluster, _metadata_log._cluster_size);
> + mlogger.debug("Bootstraping from cluster {}...", curr_cluster);
> + return _metadata_log._device.read(curr_cluster_disk_offset, _curr_cluster_data.get_write(),
> + _metadata_log._cluster_size).then([this, curr_cluster](size_t bytes_read) {
> + if (bytes_read != _metadata_log._cluster_size) {
> + return make_exception_future(std::runtime_error("Failed to read whole cluster of the metadata log"));
> + }
> +
> + mlogger.debug("Read cluster {}", curr_cluster);
> + _curr_cluster = data_reader(_curr_cluster_data.get(), _metadata_log._cluster_size);
> + return bootstrap_read_cluster();
> + });
> +}
> +
> +future<> metadata_log_bootstrap::bootstrap_read_cluster() {
> + // Process cluster: the data layout format is:
> + // | checkpoint1 | data1... | checkpoint2 | data2... | ... |
> + return do_with(false, [this](bool& whole_log_ended) {
> + return do_until([this, &whole_log_ended] { return whole_log_ended or _next_cluster.has_value(); },
> + [this, &whole_log_ended] {
> + _curr_cluster.align_curr_pos(_metadata_log._alignment);
> + _curr_cluster.checkpoint_curr_pos();
> +
> + if (not read_and_check_checkpoint()) {
> + mlogger.debug("Checkpoint invalid");
> + whole_log_ended = true;

> + return now();
> + }
> +

> + mlogger.debug("Checkpoint correct");
> + return bootstrap_checkpointed_data();
> + }).then([] {
> + mlogger.debug("Cluster ended");
> + });
> + });
> +}
> +
> +bool metadata_log_bootstrap::read_and_check_checkpoint() {
> + mlogger.debug("Processing checkpoint at {}", _curr_cluster.curr_pos());
> + ondisk_type entry_type;
> + ondisk_checkpoint checkpoint;
> + if (not _curr_cluster.read_entry(entry_type)) {
> + mlogger.debug("Cannot read entry type");
> + return false;
> + }
> + if (entry_type != CHECKPOINT) {
> + mlogger.debug("Entry type (= {}) is not CHECKPOINT (= {})", entry_type, CHECKPOINT);
> + return false;
> + }
> + if (not _curr_cluster.read_entry(checkpoint)) {
> + mlogger.debug("Cannot read checkpoint entry");
> + return false;
> + }
> +
> + boost::crc_32_type crc;
> + if (not _curr_cluster.process_crc_without_reading(crc, checkpoint.checkpointed_data_length)) {
> + mlogger.debug("Invalid checkpoint's data length: {}", (unit_size_t)checkpoint.checkpointed_data_length);
> + return false;
> + }
> + crc.process_bytes(&checkpoint.checkpointed_data_length, sizeof(checkpoint.checkpointed_data_length));
> + if (crc.checksum() != checkpoint.crc32_code) {
> + mlogger.debug("CRC code does not match: computed = {}, read = {}", crc.checksum(), (uint32_t)checkpoint.crc32_code);
> + return false;
> + }
> +
> + auto opt = _curr_cluster.extract(checkpoint.checkpointed_data_length);
> + assert(opt.has_value());
> + _curr_checkpoint = *opt;
> + return true;
> +}
> +
> +future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
> + return do_with(ondisk_type {}, [this](ondisk_type& entry_type) {
> + return do_until([this, &entry_type] { return not _curr_checkpoint.read_entry(entry_type); },
> + [this, &entry_type] {
> + switch (entry_type) {
> + case INVALID:
> + case CHECKPOINT: // CHECKPOINT cannot appear as part of checkpointed data
> + return invalid_entry_exception();
> + case NEXT_METADATA_CLUSTER:
> + return bootstrap_next_metadata_cluster();
> + }
> +
> + // unknown type => metadata log corruption
> + return invalid_entry_exception();
> + }).then([this] {
> + if (_curr_checkpoint.bytes_left() > 0) {
> + return invalid_entry_exception(); // Corrupted checkpointed data

> + }
> + return now();
> + });

> + });
> +}
> +
> +future<> metadata_log_bootstrap::bootstrap_next_metadata_cluster() {
> + ondisk_next_metadata_cluster entry;
> + if (not _curr_checkpoint.read_entry(entry)) {
> + return invalid_entry_exception();
> + }
> +
> + if (_next_cluster.has_value()) {
> + return invalid_entry_exception(); // Only one NEXT_METADATA_CLUSTER may appear in one cluster
> + }
> +
> + _next_cluster = (cluster_id_t)entry.cluster_id;

> + return now();
> +}
> +

> +bool metadata_log_bootstrap::inode_exists(inode_t inode) {
> + return _metadata_log._inodes.count(inode) != 0;
> +}
> +
> +future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
> + cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
> + // Clear the metadata log
> + metadata_log._inodes.clear();
> + metadata_log._background_futures = now();
> + metadata_log._root_dir = root_dir;
> + metadata_log._inodes.emplace(root_dir, inode_info {
> + 0,
> + 0,
> + {}, // TODO: change it to something meaningful
> + inode_info::directory {}
> + });
> +
> + return do_with(metadata_log_bootstrap(metadata_log, available_clusters),
> + [first_metadata_cluster_id, fs_shards_pool_size, fs_shard_id](metadata_log_bootstrap& bootstrap) {
> + return bootstrap.bootstrap(first_metadata_cluster_id, fs_shards_pool_size, fs_shard_id);
> + });
> +}
> +
> +} // namespace seastar::fs
> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index 8a59eca6..19666a8a 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -658,6 +658,7 @@ if (Seastar_EXPERIMENTAL_FS)

> PRIVATE
> # SeastarFS source files

> include/seastar/fs/block_device.hh
> + include/seastar/fs/exceptions.hh
> include/seastar/fs/file.hh
> include/seastar/fs/overloaded.hh
> include/seastar/fs/temporary_file.hh
> @@ -670,9 +671,18 @@ if (Seastar_EXPERIMENTAL_FS)
> src/fs/crc.hh
> src/fs/file.cc
> src/fs/inode.hh
> + src/fs/inode_info.hh
> + src/fs/metadata_disk_entries.hh
> + src/fs/metadata_log.cc
> + src/fs/metadata_log.hh
> + src/fs/metadata_log_bootstrap.cc
> + src/fs/metadata_log_bootstrap.hh
> + src/fs/metadata_to_disk_buffer.hh
> src/fs/path.hh
> src/fs/range.hh
> + src/fs/to_disk_buffer.hh
> src/fs/units.hh
> + src/fs/unix_metadata.hh
> src/fs/value_shared_lock.hh
> )
> endif()

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 8:13:09 AM4/22/20

to Piotr Sarna, Krzysztof Małysa, seastar-dev, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

This is certainly true, but could be implemented just as well on top of seastar::file.

Secondly, having a separate wrapper abstraction would express explicitly that it's not just a regular file, but a raw block device, so we could for example encapsulate all interesting metrics in one place - block device is somewhat special, so wrapping it in a more specific wrapper narrows its use-cases and provides more context. It's similar to how Scylla's materialized views have a view_ptr wrapper, which is pretty much only a schema_ptr, of which we *know* that it is a materialized view. block_device() is also quite a thin wrapper now, but it can potentially be more, should we want to provide some block-device-specific assertions or optimizations. These are not super strong points and one of them depends on a potential support from Seastar reactor which does not exist yet, so I can be convinced to drop the block_device abstraction altogether, but nonetheless, these were my original reasons.

This is a legitimate reason. Indeed Linux has both internally and externally block device (also called block_device) and files, and you can convert from one to the other at will (opening a block device will return a file, and the loop driver will convert a file to a block device). It used to be a good way to trigger stack overflows.

I don't mind block_device by itself, but note it has to live outside fs/ if it wants to be a seastar-level abstraction.

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 8:15:04 AM4/22/20

to Benny Halevy, Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/22/20 2:00 PM, Benny Halevy wrote:
> Hi Krzysztof,
>
> Thanks for doing this!
> One question regarding the high level interface to the filesystem.
> A direction I had in mind for quite a while but never got to do it
> is to define a virtual filesystem abstraction in seastar that covers both
> metadata operations and I/O operations, so that we can present and easily use
> multiple filesystem implementations like posix, a ram-based file system
> for testing (with error injection), and now your filesystem.

Yes, this is more or less required, similar to Linux vfs. But it can
easily be delayed until later.

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 8:37:45 AM4/22/20

to Piotr Sarna, Krzysztof Małysa, seastar-dev, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Here's another reason: per-request force-unit-access (FUA) flag.

Files can choose whether all their requests are FUA, or none of them, by enabling O_DSYNC. A filesystem that multiplexes O_DSYNC and !O_DSYNC files on a block device will generate a stream of requests, some FUA and some not, and the file interface has no way to specify this property. And since Scylla already mixes O_DSYNC and regular files, we already have this distinction (although it collapses on enterprise disks where every request is FUA anyway).

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 9:27:04 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> Creating unlinked file may be useful as temporary file or to expose the
> file via path only after the file is filled with contents.

>
> Signed-off-by: Krzysztof Małysa <var...@gmail.com>
> ---

> src/fs/metadata_disk_entries.hh | 51 +++++++++++-
> src/fs/metadata_log.hh | 6 ++
> src/fs/metadata_log_bootstrap.hh | 2 +
> .../create_and_open_unlinked_file.hh | 77 +++++++++++++++++++
> src/fs/metadata_to_disk_buffer.hh | 5 ++
> src/fs/metadata_log.cc | 21 +++++
> src/fs/metadata_log_bootstrap.cc | 13 ++++
> CMakeLists.txt | 1 +
> 8 files changed, 175 insertions(+), 1 deletion(-)
> create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
>
> diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
> index 44c2a1c7..437c2c2b 100644
> --- a/src/fs/metadata_disk_entries.hh
> +++ b/src/fs/metadata_disk_entries.hh
> @@ -27,10 +27,52 @@
>
> namespace seastar::fs {
>
> +struct ondisk_unix_metadata {
> + uint32_t perms;
> + uint32_t uid;
> + uint32_t gid;

> + uint64_t btime_ns;
> + uint64_t mtime_ns;
> + uint64_t ctime_ns;

> +} __attribute__((packed));
> +
> +static_assert(sizeof(decltype(ondisk_unix_metadata::perms)) >= sizeof(decltype(unix_metadata::perms)));
> +static_assert(sizeof(decltype(ondisk_unix_metadata::uid)) >= sizeof(decltype(unix_metadata::uid)));
> +static_assert(sizeof(decltype(ondisk_unix_metadata::gid)) >= sizeof(decltype(unix_metadata::gid)));
> +static_assert(sizeof(decltype(ondisk_unix_metadata::btime_ns)) >= sizeof(decltype(unix_metadata::btime_ns)));
> +static_assert(sizeof(decltype(ondisk_unix_metadata::mtime_ns)) >= sizeof(decltype(unix_metadata::mtime_ns)));
> +static_assert(sizeof(decltype(ondisk_unix_metadata::ctime_ns)) >= sizeof(decltype(unix_metadata::ctime_ns)));
> +
> +inline unix_metadata ondisk_metadata_to_metadata(const ondisk_unix_metadata& ondisk_metadata) noexcept {
> + unix_metadata res;
> + static_assert(sizeof(ondisk_metadata) == 36,
> + "metadata size changed: check if above static asserts and below assignments need update");
> + res.perms = static_cast<file_permissions>(ondisk_metadata.perms);
> + res.uid = ondisk_metadata.uid;
> + res.gid = ondisk_metadata.gid;
> + res.btime_ns = ondisk_metadata.btime_ns;
> + res.mtime_ns = ondisk_metadata.mtime_ns;
> + res.ctime_ns = ondisk_metadata.ctime_ns;

Need to deal with endinaness here. See adjust_endinaness() in net/ for
one way to do this, but since you have functions for conversion you can
do it in these functions too. I suggest to use little endian as the base
format, since that's where it will be used.

> + return res;
> +}
> +

> +inline ondisk_unix_metadata metadata_to_ondisk_metadata(const unix_metadata& metadata) noexcept {
> + ondisk_unix_metadata res;
> + static_assert(sizeof(res) == 36, "metadata size changed: check if below assignments need update");
> + res.perms = static_cast<decltype(res.perms)>(metadata.perms);
> + res.uid = metadata.uid;
> + res.gid = metadata.gid;
> + res.btime_ns = metadata.btime_ns;
> + res.mtime_ns = metadata.mtime_ns;
> + res.ctime_ns = metadata.ctime_ns;

> + return res;
> +}
> +

> enum ondisk_type : uint8_t {
> INVALID = 0,
> CHECKPOINT,
> NEXT_METADATA_CLUSTER,
> + CREATE_INODE,
> };
>
> struct ondisk_checkpoint {
> @@ -54,9 +96,16 @@ struct ondisk_next_metadata_cluster {

> cluster_id_t cluster_id; // metadata log continues there

> } __attribute__((packed));
>
> +struct ondisk_create_inode {
> + inode_t inode;
> + uint8_t is_directory;
> + ondisk_unix_metadata metadata;
> +} __attribute__((packed));
> +
> template<typename T>

> constexpr size_t ondisk_entry_size(const T& entry) noexcept {

> - static_assert(std::is_same_v<T, ondisk_next_metadata_cluster>, "ondisk entry size not defined for given type");
> + static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or
> + std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
> return sizeof(ondisk_type) + sizeof(entry);
> }
>
> diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
> index c10852a3..6f069c13 100644
> --- a/src/fs/metadata_log.hh
> +++ b/src/fs/metadata_log.hh
> @@ -156,6 +156,8 @@ class metadata_log {
>
> friend class metadata_log_bootstrap;
>
> + friend class create_and_open_unlinked_file_operation;
> +
> public:

> metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,

> shared_ptr<metadata_to_disk_buffer> cluster_buff);
> @@ -176,6 +178,8 @@ class metadata_log {
> return _inodes.count(inode) != 0;
> }
>
> + inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
> +

How about s/bool/file_kind/, to support symbolic links and other animals.

> template<class Func>
> void schedule_background_task(Func&& task) {

> _background_futures = when_all_succeed(_background_futures.get_future(), std::forward<Func>(task));

> @@ -286,6 +290,8 @@ class metadata_log {

> // Returns size of the file or throws exception iff @p inode is invalid

> file_offset_t file_size(inode_t inode) const;
>
> + future<inode_t> create_and_open_unlinked_file(file_permissions perms);

> +
> // All disk-related errors will be exposed here

> future<> flush_log() {
> return flush_curr_cluster();
> diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
> index 5da79631..4a1fa7e9 100644
> --- a/src/fs/metadata_log_bootstrap.hh
> +++ b/src/fs/metadata_log_bootstrap.hh
> @@ -115,6 +115,8 @@ class metadata_log_bootstrap {
>
> bool inode_exists(inode_t inode);
>
> + future<> bootstrap_create_inode();
> +
> public:

> static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,

> cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);

> diff --git a/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh b/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
> new file mode 100644
> index 00000000..79c5e9f2
> --- /dev/null
> +++ b/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
> @@ -0,0 +1,77 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*

> + */
> +
> +#pragma once
> +

> +#include "fs/metadata_disk_entries.hh"
> +#include "fs/metadata_log.hh"
> +#include "seastar/core/future.hh"

> +
> +namespace seastar::fs {
> +

> +class create_and_open_unlinked_file_operation {
> + metadata_log& _metadata_log;
> +
> + create_and_open_unlinked_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
> +
> + future<inode_t> create_and_open_unlinked_file(file_permissions perms) {
> + using namespace std::chrono;
> + uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
> + unix_metadata unx_mtdt = {
> + perms,
> + 0, // TODO: Eventually, we'll want a user to be able to pass his credentials when bootstrapping the
> + 0, // file system -- that will allow us to authorize users on startup (e.g. via LDAP or whatnot).
> + curr_time_ns,
> + curr_time_ns,
> + curr_time_ns
> + };
> +
> + inode_t new_inode = _metadata_log._inode_allocator.alloc();
> + ondisk_create_inode ondisk_entry {
> + new_inode,
> + false,
> + metadata_to_ondisk_metadata(unx_mtdt)
> + };
> +
> + switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {
> + case metadata_log::append_result::TOO_BIG:
> + return make_exception_future<inode_t>(cluster_size_too_small_to_perform_operation_exception());
> + case metadata_log::append_result::NO_SPACE:
> + return make_exception_future<inode_t>(no_more_space_exception());
> + case metadata_log::append_result::APPENDED:
> + inode_info& new_inode_info = _metadata_log.memory_only_create_inode(new_inode, false, unx_mtdt);

What if this fails?

Our options are:

- first create the new inode in memory (but locked), then either
unlock it or back it out

- undo the append somehow

> + // We don't have to lock, as there was no context switch since the allocation of the inode number
> + ++new_inode_info.opened_files_count;

This will have to change in the future, since it allows someone in a
tight loop to exhaust the log. For example consider creating and
deleting a file in a loop. So all metadata operations will have to wait
for log compaction to catch up (usually this will return immediately).

> + return make_ready_future<inode_t>(new_inode);

> + }
> + __builtin_unreachable();
> + }
> +

> +public:
> + static future<inode_t> perform(metadata_log& metadata_log, file_permissions perms) {
> + return do_with(create_and_open_unlinked_file_operation(metadata_log),
> + [perms = std::move(perms)](auto& obj) {
> + return obj.create_and_open_unlinked_file(std::move(perms));

> + });
> + }
> +};
> +

> +} // namespace seastar::fs
> diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

> index bd60f4f3..593ad46a 100644
> --- a/src/fs/metadata_to_disk_buffer.hh
> +++ b/src/fs/metadata_to_disk_buffer.hh
> @@ -152,6 +152,11 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
> }
>
> public:
> + [[nodiscard]] virtual append_result append(const ondisk_create_inode& create_inode) noexcept {
> + // TODO: maybe add a constexpr static field to each ondisk_* entry specifying what type it is?
> + return append_simple(CREATE_INODE, create_inode);
> + }
> +
> using to_disk_buffer::flush_to_disk;
> };
>
> diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
> index 6e29f2e5..be523fc7 100644
> --- a/src/fs/metadata_log.cc
> +++ b/src/fs/metadata_log.cc
> @@ -26,6 +26,7 @@
> #include "fs/metadata_disk_entries.hh"
> #include "fs/metadata_log.hh"
> #include "fs/metadata_log_bootstrap.hh"
> +#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
> #include "fs/metadata_to_disk_buffer.hh"
> #include "fs/path.hh"
> #include "fs/units.hh"
> @@ -80,6 +81,22 @@ future<> metadata_log::shutdown() {
> });
> }
>
> +inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata) {
> + assert(_inodes.count(inode) == 0);
> + return _inodes.emplace(inode, inode_info {
> + 0,
> + 0,
> + metadata,
> + [&]() -> decltype(inode_info::contents) {
> + if (is_directory) {
> + return inode_info::directory {};
> + }
> +
> + return inode_info::file {};
> + }()
> + }).first->second;
> +}
> +
> void metadata_log::schedule_flush_of_curr_cluster() {

> // Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)

> schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {

> @@ -213,6 +230,10 @@ file_offset_t metadata_log::file_size(inode_t inode) const {
> }, it->second.contents);
> }
>
> +future<inode_t> metadata_log::create_and_open_unlinked_file(file_permissions perms) {
> + return create_and_open_unlinked_file_operation::perform(*this, std::move(perms));
> +}
> +

> // TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,

> // then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()

> // without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially

> diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
> index 926d79fe..702e0e34 100644
> --- a/src/fs/metadata_log_bootstrap.cc
> +++ b/src/fs/metadata_log_bootstrap.cc
> @@ -211,6 +211,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
> return invalid_entry_exception();
> case NEXT_METADATA_CLUSTER:
> return bootstrap_next_metadata_cluster();
> + case CREATE_INODE:
> + return bootstrap_create_inode();

> }
>
> // unknown type => metadata log corruption

> @@ -242,6 +244,17 @@ bool metadata_log_bootstrap::inode_exists(inode_t inode) {
> return _metadata_log._inodes.count(inode) != 0;
> }
>
> +future<> metadata_log_bootstrap::bootstrap_create_inode() {
> + ondisk_create_inode entry;
> + if (not _curr_checkpoint.read_entry(entry) or inode_exists(entry.inode)) {

> + return invalid_entry_exception();
> + }
> +

> + _metadata_log.memory_only_create_inode(entry.inode, entry.is_directory,
> + ondisk_metadata_to_metadata(entry.metadata));

> + return now();
> +}
> +

> future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,

> cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {

> // Clear the metadata log

> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index 19666a8a..3304a02b 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -677,6 +677,7 @@ if (Seastar_EXPERIMENTAL_FS)
> src/fs/metadata_log.hh
> src/fs/metadata_log_bootstrap.cc
> src/fs/metadata_log_bootstrap.hh
> + src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
> src/fs/metadata_to_disk_buffer.hh
> src/fs/path.hh
> src/fs/range.hh

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 9:36:25 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:
> Signed-off-by: Krzysztof Małysa <var...@gmail.com>
> ---

> src/fs/metadata_disk_entries.hh | 11 ++
> src/fs/metadata_log.hh | 8 +
> src/fs/metadata_log_bootstrap.hh | 2 +
> src/fs/metadata_log_operations/create_file.hh | 174 ++++++++++++++++++
> src/fs/metadata_to_disk_buffer.hh | 13 ++
> src/fs/metadata_log.cc | 24 +++
> src/fs/metadata_log_bootstrap.cc | 30 +++
> CMakeLists.txt | 1 +
> 8 files changed, 263 insertions(+)
> create mode 100644 src/fs/metadata_log_operations/create_file.hh
>
> diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
> index 437c2c2b..9c44b8cc 100644
> --- a/src/fs/metadata_disk_entries.hh
> +++ b/src/fs/metadata_disk_entries.hh
> @@ -73,6 +73,7 @@ enum ondisk_type : uint8_t {
> CHECKPOINT,
> NEXT_METADATA_CLUSTER,
> CREATE_INODE,
> + CREATE_INODE_AS_DIR_ENTRY,

Perhaps these should come with an explicit initializer, so people
looking at on-disk data can easily find it, and to reduce the temptation
to insert in the middle.

> };
>
> struct ondisk_checkpoint {
> @@ -102,11 +103,21 @@ struct ondisk_create_inode {
> ondisk_unix_metadata metadata;
> } __attribute__((packed));
>
> +struct ondisk_create_inode_as_dir_entry_header {
> + ondisk_create_inode entry_inode;
> + inode_t dir_inode;
> + uint16_t entry_name_length;
> + // After header comes entry name

> +} __attribute__((packed));
> +
> template<typename T>
> constexpr size_t ondisk_entry_size(const T& entry) noexcept {

> static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or

> std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
> return sizeof(ondisk_type) + sizeof(entry);
> }

> +constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {
> + return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;

> +}
>
> } // namespace seastar::fs

> diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
> index 6f069c13..cc11b865 100644
> --- a/src/fs/metadata_log.hh
> +++ b/src/fs/metadata_log.hh
> @@ -157,6 +157,7 @@ class metadata_log {
> friend class metadata_log_bootstrap;
>
> friend class create_and_open_unlinked_file_operation;
> + friend class create_file_operation;

>
> public:
> metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,

> @@ -179,6 +180,7 @@ class metadata_log {

> }
>
> inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);

> + void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);

>
> template<class Func>
> void schedule_background_task(Func&& task) {

> @@ -290,8 +292,14 @@ class metadata_log {

> // Returns size of the file or throws exception iff @p inode is invalid
> file_offset_t file_size(inode_t inode) const;
>

> + future<> create_file(std::string path, file_permissions perms);
> +
> + future<inode_t> create_and_open_file(std::string path, file_permissions perms);

> +
> future<inode_t> create_and_open_unlinked_file(file_permissions perms);
>

> + future<> create_directory(std::string path, file_permissions perms);

> +
> // All disk-related errors will be exposed here
> future<> flush_log() {
> return flush_curr_cluster();
> diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh

> index 4a1fa7e9..d44c2f96 100644
> --- a/src/fs/metadata_log_bootstrap.hh
> +++ b/src/fs/metadata_log_bootstrap.hh
> @@ -117,6 +117,8 @@ class metadata_log_bootstrap {
>
> future<> bootstrap_create_inode();
>
> + future<> bootstrap_create_inode_as_dir_entry();

> +
> public:
> static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
> cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);

> diff --git a/src/fs/metadata_log_operations/create_file.hh b/src/fs/metadata_log_operations/create_file.hh
> new file mode 100644
> index 00000000..3ba83226
> --- /dev/null
> +++ b/src/fs/metadata_log_operations/create_file.hh
> @@ -0,0 +1,174 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2020 ScyllaDB
> + */
> +
> +#pragma once
> +

> +#include "fs/metadata_log.hh"
> +#include "fs/path.hh"

> +#include "seastar/core/future.hh"
> +
> +namespace seastar::fs {
> +

> +enum class create_semantics {
> + CREATE_FILE,
> + CREATE_AND_OPEN_FILE,
> + CREATE_DIR,
> +};
> +
> +class create_file_operation {
> + metadata_log& _metadata_log;
> + create_semantics _create_semantics;
> + std::string _entry_name;
> + file_permissions _perms;
> + inode_t _dir_inode;
> + inode_info::directory* _dir_info;
> +
> + create_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
> +
> + future<inode_t> create_file(std::string path, file_permissions perms, create_semantics create_semantics) {
> + _create_semantics = create_semantics;
> + switch (create_semantics) {
> + case create_semantics::CREATE_FILE:
> + case create_semantics::CREATE_AND_OPEN_FILE:
> + break;
> + case create_semantics::CREATE_DIR:
> + while (not path.empty() and path.back() == '/') {
> + path.pop_back();
> + }
> + }
> +
> + _entry_name = extract_last_component(path);
> + if (_entry_name.empty()) {
> + return make_exception_future<inode_t>(invalid_path_exception());
> + }
> + assert(path.empty() or path.back() == '/'); // Hence fast-checking for "is directory" is done in path_lookup
> +
> + _perms = perms;
> + return _metadata_log.path_lookup(path).then([this](inode_t dir_inode) {
> + _dir_inode = dir_inode;
> + // Fail-fast checks before locking (as locking may be expensive)
> + auto dir_it = _metadata_log._inodes.find(_dir_inode);
> + if (dir_it == _metadata_log._inodes.end()) {
> + return make_exception_future<inode_t>(operation_became_invalid_exception());
> + }
> + assert(dir_it->second.is_directory() and "Directory cannot become file or there is a BUG in path_lookup");
> + _dir_info = &dir_it->second.get_directory();
> +
> + if (_dir_info->entries.count(_entry_name) != 0) {
> + return make_exception_future<inode_t>(file_already_exists_exception());
> + }
> +
> + return _metadata_log._locks.with_locks(metadata_log::locks::shared {dir_inode},
> + metadata_log::locks::unique {dir_inode, _entry_name}, [this] {
> + return create_file_in_directory();
> + });
> + });
> + }
> +
> + future<inode_t> create_file_in_directory() {
> + if (not _metadata_log.inode_exists(_dir_inode)) {
> + return make_exception_future<inode_t>(operation_became_invalid_exception());
> + }
> +
> + if (_dir_info->entries.count(_entry_name) != 0) {
> + return make_exception_future<inode_t>(file_already_exists_exception());
> + }
> +
> + ondisk_create_inode_as_dir_entry_header ondisk_entry;
> + decltype(ondisk_entry.entry_name_length) entry_name_length;
> + if (_entry_name.size() > std::numeric_limits<decltype(entry_name_length)>::max()) {
> + // TODO: add an assert that the culster_size is not too small as it would cause to allocate all clusters
> + // and then return error ENOSPACE
> + return make_exception_future<inode_t>(filename_too_long_exception());
> + }
> + entry_name_length = _entry_name.size();
> +

> + using namespace std::chrono;
> + uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
> + unix_metadata unx_mtdt = {

> + _perms,

> + 0, // TODO: Eventually, we'll want a user to be able to pass his credentials when bootstrapping the
> + 0, // file system -- that will allow us to authorize users on startup (e.g. via LDAP or whatnot).
> + curr_time_ns,
> + curr_time_ns,
> + curr_time_ns
> + };
> +

> + bool creating_dir = [this] {
> + switch (_create_semantics) {
> + case create_semantics::CREATE_FILE:
> + case create_semantics::CREATE_AND_OPEN_FILE:
> + return false;
> + case create_semantics::CREATE_DIR:
> + return true;
> + }
> + __builtin_unreachable();
> + }();

> +
> + inode_t new_inode = _metadata_log._inode_allocator.alloc();
> +

> + ondisk_entry = {
> + {
> + new_inode,
> + creating_dir,
> + metadata_to_ondisk_metadata(unx_mtdt)
> + },
> + _dir_inode,
> + entry_name_length,
> + };
> +

C++20 added designated initializers, let's use them.

> +
> + switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {

> + case metadata_log::append_result::TOO_BIG:
> + return make_exception_future<inode_t>(cluster_size_too_small_to_perform_operation_exception());
> + case metadata_log::append_result::NO_SPACE:
> + return make_exception_future<inode_t>(no_more_space_exception());
> + case metadata_log::append_result::APPENDED:
> + inode_info& new_inode_info = _metadata_log.memory_only_create_inode(new_inode,

> + creating_dir, unx_mtdt);
> + _metadata_log.memory_only_add_dir_entry(*_dir_info, new_inode, std::move(_entry_name));
> +

Now we have the same rollback problem, only with two potential places to
roll back. So let's think of a strategy to do this. I guess installing
"tentative" entries and then either destroying them or marking them as
concrete when everything completes will work, but will require locking
the tentative entries (which I guess you're doing anyway).

> + switch (_create_semantics) {
> + case create_semantics::CREATE_FILE:
> + case create_semantics::CREATE_DIR:
> + break;
> + case create_semantics::CREATE_AND_OPEN_FILE:

> + // We don't have to lock, as there was no context switch since the allocation of the inode number
> + ++new_inode_info.opened_files_count;

> + break;
> + }
> +

> + return make_ready_future<inode_t>(new_inode);
> + }
> + __builtin_unreachable();
> + }
> +
> +public:

> + static future<inode_t> perform(metadata_log& metadata_log, std::string path, file_permissions perms,
> + create_semantics create_semantics) {
> + return do_with(create_file_operation(metadata_log),
> + [path = std::move(path), perms = std::move(perms), create_semantics](auto& obj) {
> + return obj.create_file(std::move(path), std::move(perms), create_semantics);

> + });
> + }
> +};
> +
> +} // namespace seastar::fs
> diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

> index 593ad46a..87b2bd8e 100644
> --- a/src/fs/metadata_to_disk_buffer.hh
> +++ b/src/fs/metadata_to_disk_buffer.hh
> @@ -157,6 +157,19 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
> return append_simple(CREATE_INODE, create_inode);
> }
>
> + [[nodiscard]] virtual append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,
> + const void* entry_name) noexcept {
> + ondisk_type type = CREATE_INODE_AS_DIR_ENTRY;
> + if (not fits_for_append(ondisk_entry_size(create_inode_as_dir_entry))) {

> + return TOO_BIG;
> + }
> +
> + append_bytes(&type, sizeof(type));

> + append_bytes(&create_inode_as_dir_entry, sizeof(create_inode_as_dir_entry));
> + append_bytes(entry_name, create_inode_as_dir_entry.entry_name_length);
> + return APPENDED;

> + }
> +
> using to_disk_buffer::flush_to_disk;
> };
>
> diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc

> index be523fc7..d35d3710 100644
> --- a/src/fs/metadata_log.cc
> +++ b/src/fs/metadata_log.cc
> @@ -27,6 +27,7 @@
> #include "fs/metadata_log.hh"
> #include "fs/metadata_log_bootstrap.hh"
> #include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
> +#include "fs/metadata_log_operations/create_file.hh"

> #include "fs/metadata_to_disk_buffer.hh"
> #include "fs/path.hh"
> #include "fs/units.hh"

> @@ -97,6 +98,17 @@ inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_direct
> }).first->second;
> }
>
> +void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {
> + auto it = _inodes.find(entry_inode);
> + assert(it != _inodes.end());
> + // Directory may only be linked once (to avoid creating cycles)
> + assert(not it->second.is_directory() or not it->second.is_linked());
> +
> + bool inserted = dir.entries.emplace(std::move(entry_name), entry_inode).second;
> + assert(inserted);
> + ++it->second.directories_containing_file;

> +}
> +
> void metadata_log::schedule_flush_of_curr_cluster() {
> // Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
> schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {

> @@ -230,10 +242,22 @@ file_offset_t metadata_log::file_size(inode_t inode) const {
> }, it->second.contents);
> }
>
> +future<> metadata_log::create_file(std::string path, file_permissions perms) {
> + return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_FILE).discard_result();
> +}
> +
> +future<inode_t> metadata_log::create_and_open_file(std::string path, file_permissions perms) {
> + return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_AND_OPEN_FILE);
> +}
> +
> future<inode_t> metadata_log::create_and_open_unlinked_file(file_permissions perms) {

> return create_and_open_unlinked_file_operation::perform(*this, std::move(perms));
> }
>

> +future<> metadata_log::create_directory(std::string path, file_permissions perms) {
> + return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_DIR).discard_result();

> +}
> +
> // TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
> // then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
> // without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
> diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc

> index 702e0e34..01b567f0 100644
> --- a/src/fs/metadata_log_bootstrap.cc
> +++ b/src/fs/metadata_log_bootstrap.cc
> @@ -213,6 +213,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
> return bootstrap_next_metadata_cluster();
> case CREATE_INODE:
> return bootstrap_create_inode();
> + case CREATE_INODE_AS_DIR_ENTRY:
> + return bootstrap_create_inode_as_dir_entry();

> }
>
> // unknown type => metadata log corruption

> @@ -255,6 +257,34 @@ future<> metadata_log_bootstrap::bootstrap_create_inode() {
> return now();
> }
>
> +future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
> + ondisk_create_inode_as_dir_entry_header entry;
> + if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
> + inode_exists(entry.entry_inode.inode)) {

> + return invalid_entry_exception();
> + }
> +

> + std::string dir_entry_name;
> + if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {

> + return invalid_entry_exception();
> + }
> +

> + if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
> + return invalid_entry_exception();
> + }
> + auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
> +
> + if (dir.entries.count(dir_entry_name) != 0) {

> + return invalid_entry_exception();
> + }
> +

> + _metadata_log.memory_only_create_inode(entry.entry_inode.inode, entry.entry_inode.is_directory,
> + ondisk_metadata_to_metadata(entry.entry_inode.metadata));
> + _metadata_log.memory_only_add_dir_entry(dir, entry.entry_inode.inode, std::move(dir_entry_name));
> + // TODO: Maybe mtime_ns for modifying directory?

> + return now();
> +}
> +
> future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
> cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
> // Clear the metadata log
> diff --git a/CMakeLists.txt b/CMakeLists.txt

> index 3304a02b..46cdf803 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -678,6 +678,7 @@ if (Seastar_EXPERIMENTAL_FS)
> src/fs/metadata_log_bootstrap.cc
> src/fs/metadata_log_bootstrap.hh
> src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
> + src/fs/metadata_log_operations/create_file.hh
> src/fs/metadata_to_disk_buffer.hh
> src/fs/path.hh
> src/fs/range.hh

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 9:39:52 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> Some operations need to schedule deleting inode in the background. One
> of these is closing unlinked file if nobody else holds it open.

As I mentioned previously, it is fine to do work in the background, but
it has to be bounded. As soon as some limit is reached, new file
creation has to wait until background backlog drops.

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 9:42:48 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

> Allows the same file to be visible via different paths or to give a path
> to an unlinked file.

>
> Signed-off-by: Krzysztof Małysa <var...@gmail.com>
> ---
> src/fs/metadata_disk_entries.hh | 11 ++

> src/fs/metadata_log.hh | 7 ++
> src/fs/metadata_log_bootstrap.hh | 2 +
> src/fs/metadata_log_operations/link_file.hh | 112 ++++++++++++++++++++
> src/fs/metadata_to_disk_buffer.hh | 12 +++
> src/fs/metadata_log.cc | 11 ++
> src/fs/metadata_log_bootstrap.cc | 34 ++++++
> CMakeLists.txt | 1 +
> 8 files changed, 190 insertions(+)
> create mode 100644 src/fs/metadata_log_operations/link_file.hh
>
> diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
> index 310b1864..b81c25f5 100644
> --- a/src/fs/metadata_disk_entries.hh
> +++ b/src/fs/metadata_disk_entries.hh
> @@ -74,6 +74,7 @@ enum ondisk_type : uint8_t {
> NEXT_METADATA_CLUSTER,
> CREATE_INODE,
> DELETE_INODE,
> + ADD_DIR_ENTRY,
> CREATE_INODE_AS_DIR_ENTRY,
> };
>
> @@ -108,6 +109,13 @@ struct ondisk_delete_inode {
> inode_t inode;
> } __attribute__((packed));
>
> +struct ondisk_add_dir_entry_header {
> + inode_t dir_inode;
> + inode_t entry_inode;

> + uint16_t entry_name_length;
> + // After header comes entry name
> +} __attribute__((packed));
> +

> struct ondisk_create_inode_as_dir_entry_header {
> ondisk_create_inode entry_inode;
> inode_t dir_inode;
> @@ -122,6 +130,9 @@ constexpr size_t ondisk_entry_size(const T& entry) noexcept {
> std::is_same_v<T, ondisk_delete_inode>, "ondisk entry size not defined for given type");
> return sizeof(ondisk_type) + sizeof(entry);
> }
> +constexpr size_t ondisk_entry_size(const ondisk_add_dir_entry_header& entry) noexcept {

> + return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
> +}
> constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {
> return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
> }

> diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
> index be5e843b..f5373458 100644
> --- a/src/fs/metadata_log.hh
> +++ b/src/fs/metadata_log.hh
> @@ -158,6 +158,7 @@ class metadata_log {
>
> friend class create_and_open_unlinked_file_operation;
> friend class create_file_operation;
> + friend class link_file_operation;

>
> public:
> metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,

> @@ -303,6 +304,12 @@ class metadata_log {

>
> future<> create_directory(std::string path, file_permissions perms);
>

> + // Creates name (@p path) for a file (@p inode)
> + future<> link_file(inode_t inode, std::string path);
> +
> + // Creates name (@p destination) for a file (not directory) @p source
> + future<> link_file(std::string source, std::string destination);

> +
> // All disk-related errors will be exposed here
> future<> flush_log() {
> return flush_curr_cluster();
> diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh

> index b28bce7f..3b38b232 100644
> --- a/src/fs/metadata_log_bootstrap.hh
> +++ b/src/fs/metadata_log_bootstrap.hh
> @@ -119,6 +119,8 @@ class metadata_log_bootstrap {
>
> future<> bootstrap_delete_inode();
>
> + future<> bootstrap_add_dir_entry();
> +
> future<> bootstrap_create_inode_as_dir_entry();
>
> public:
> diff --git a/src/fs/metadata_log_operations/link_file.hh b/src/fs/metadata_log_operations/link_file.hh
> new file mode 100644
> index 00000000..207fe327
> --- /dev/null
> +++ b/src/fs/metadata_log_operations/link_file.hh
> @@ -0,0 +1,112 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2020 ScyllaDB
> + */
> +
> +#pragma once
> +

> +#include "fs/metadata_disk_entries.hh"
> +#include "fs/metadata_log.hh"
> +#include "fs/path.hh"

> +
> +namespace seastar::fs {
> +

> +class link_file_operation {
> + metadata_log& _metadata_log;
> + inode_t _src_inode;
> + std::string _entry_name;

> + inode_t _dir_inode;
> + inode_info::directory* _dir_info;
> +

> + link_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
> +
> + future<> link_file(inode_t inode, std::string path) {
> + _src_inode = inode;

> + _entry_name = extract_last_component(path);
> + if (_entry_name.empty()) {

> + return make_exception_future(is_directory_exception());

> + }
> + assert(path.empty() or path.back() == '/'); // Hence fast-checking for "is directory" is done in path_lookup
> +

> + return _metadata_log.path_lookup(path).then([this](inode_t dir_inode) {
> + _dir_inode = dir_inode;
> + // Fail-fast checks before locking (as locking may be expensive)

Locking is not _that_ expensive, and we usually expect the checks to
pass, so it's wasted work.

> + auto dir_it = _metadata_log._inodes.find(_dir_inode);
> + if (dir_it == _metadata_log._inodes.end()) {

> + return make_exception_future(operation_became_invalid_exception());

> + }
> + assert(dir_it->second.is_directory() and "Directory cannot become file or there is a BUG in path_lookup");
> + _dir_info = &dir_it->second.get_directory();
> +
> + if (_dir_info->entries.count(_entry_name) != 0) {

> + return make_exception_future(file_already_exists_exception());

> + }
> +
> + return _metadata_log._locks.with_locks(metadata_log::locks::shared {dir_inode},
> + metadata_log::locks::unique {dir_inode, _entry_name}, [this] {

> + return link_file_in_directory();

> + });
> + });
> + }
> +

> + future<> link_file_in_directory() {
> + if (not _metadata_log.inode_exists(_dir_inode)) {
> + return make_exception_future(operation_became_invalid_exception());

> + }
> +
> + if (_dir_info->entries.count(_entry_name) != 0) {

> + return make_exception_future(file_already_exists_exception());
> + }
> +
> + ondisk_add_dir_entry_header ondisk_entry;

> + decltype(ondisk_entry.entry_name_length) entry_name_length;
> + if (_entry_name.size() > std::numeric_limits<decltype(entry_name_length)>::max()) {
> + // TODO: add an assert that the culster_size is not too small as it would cause to allocate all clusters
> + // and then return error ENOSPACE

> + return make_exception_future(filename_too_long_exception());

> + }
> + entry_name_length = _entry_name.size();
> +

> + ondisk_entry = {
> + _dir_inode,
> + _src_inode,
> + entry_name_length,
> + };

> +
> + switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {
> + case metadata_log::append_result::TOO_BIG:

> + return make_exception_future(cluster_size_too_small_to_perform_operation_exception());

> + case metadata_log::append_result::NO_SPACE:

> + return make_exception_future(no_more_space_exception());

> + case metadata_log::append_result::APPENDED:

> + _metadata_log.memory_only_add_dir_entry(*_dir_info, _src_inode, std::move(_entry_name));
> + return now();
> + }

> + __builtin_unreachable();
> + }
> +
> +public:

> + static future<> perform(metadata_log& metadata_log, inode_t inode, std::string path) {
> + return do_with(link_file_operation(metadata_log), [inode, path = std::move(path)](auto& obj) {
> + return obj.link_file(inode, std::move(path));

> + });
> + }
> +};
> +
> +} // namespace seastar::fs
> diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

> index 9eb1c538..38180224 100644
> --- a/src/fs/metadata_to_disk_buffer.hh
> +++ b/src/fs/metadata_to_disk_buffer.hh
> @@ -161,6 +161,18 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
> return append_simple(DELETE_INODE, delete_inode);
> }
>
> + [[nodiscard]] virtual append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept {
> + ondisk_type type = ADD_DIR_ENTRY;
> + if (not fits_for_append(ondisk_entry_size(add_dir_entry))) {

> + return TOO_BIG;
> + }
> +
> + append_bytes(&type, sizeof(type));

> + append_bytes(&add_dir_entry, sizeof(add_dir_entry));
> + append_bytes(entry_name, add_dir_entry.entry_name_length);

> + return APPENDED;
> + }
> +

> [[nodiscard]] virtual append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,

> const void* entry_name) noexcept {
> ondisk_type type = CREATE_INODE_AS_DIR_ENTRY;
> diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
> index 7f42f353..a8b17c2b 100644
> --- a/src/fs/metadata_log.cc
> +++ b/src/fs/metadata_log.cc
> @@ -28,6 +28,7 @@
> #include "fs/metadata_log_bootstrap.hh"
> #include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
> #include "fs/metadata_log_operations/create_file.hh"
> +#include "fs/metadata_log_operations/link_file.hh"

> #include "fs/metadata_to_disk_buffer.hh"
> #include "fs/path.hh"
> #include "fs/units.hh"

> @@ -296,6 +297,16 @@ future<> metadata_log::create_directory(std::string path, file_permissions perms

> return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_DIR).discard_result();
> }
>

> +future<> metadata_log::link_file(inode_t inode, std::string path) {
> + return link_file_operation::perform(*this, inode, std::move(path));
> +}
> +
> +future<> metadata_log::link_file(std::string source, std::string destination) {
> + return path_lookup(std::move(source)).then([this, destination = std::move(destination)](inode_t inode) {
> + return link_file(inode, std::move(destination));
> + });

> +}
> +
> // TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
> // then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
> // without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
> diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc

> index 3058328a..64396d11 100644
> --- a/src/fs/metadata_log_bootstrap.cc
> +++ b/src/fs/metadata_log_bootstrap.cc
> @@ -215,6 +215,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
> return bootstrap_create_inode();
> case DELETE_INODE:
> return bootstrap_delete_inode();
> + case ADD_DIR_ENTRY:
> + return bootstrap_add_dir_entry();
> case CREATE_INODE_AS_DIR_ENTRY:
> return bootstrap_create_inode_as_dir_entry();
> }
> @@ -278,6 +280,38 @@ future<> metadata_log_bootstrap::bootstrap_delete_inode() {
> return now();
> }
>
> +future<> metadata_log_bootstrap::bootstrap_add_dir_entry() {
> + ondisk_add_dir_entry_header entry;

> + if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or

> + not inode_exists(entry.entry_inode)) {

> + return invalid_entry_exception();
> + }
> +
> + std::string dir_entry_name;
> + if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {
> + return invalid_entry_exception();
> + }
> +

> + // Only files may be linked as not to create cycles (directories are created and linked using
> + // CREATE_INODE_AS_DIR_ENTRY)
> + if (not _metadata_log._inodes[entry.entry_inode].is_file()) {

> + return invalid_entry_exception();
> + }
> +
> + if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
> + return invalid_entry_exception();
> + }
> + auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
> +
> + if (dir.entries.count(dir_entry_name) != 0) {
> + return invalid_entry_exception();
> + }
> +

> + _metadata_log.memory_only_add_dir_entry(dir, entry.entry_inode, std::move(dir_entry_name));

> + // TODO: Maybe mtime_ns for modifying directory?
> + return now();
> +}
> +

> future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
> ondisk_create_inode_as_dir_entry_header entry;

> if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or

> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index 46cdf803..6259742e 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -679,6 +679,7 @@ if (Seastar_EXPERIMENTAL_FS)
> src/fs/metadata_log_bootstrap.hh
> src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
> src/fs/metadata_log_operations/create_file.hh
> + src/fs/metadata_log_operations/link_file.hh
> src/fs/metadata_to_disk_buffer.hh
> src/fs/path.hh
> src/fs/range.hh

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 10:20:48 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com

On 4/20/20 3:02 PM, Krzysztof Małysa wrote:
> From: Michał Niciejewski <qup...@gmail.com>
>

> Each write can be divided into multiple smaller writes that can fall
> into one of the following categories:
> - small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes
> are stored fully in memory
> - medium write: writes above SMALL_WRITE_THRESHOLD and below
> cluster_size bytes, those writes are stored on disk, they are appended
> to the on-disk data log where data from different writes can be stored
> in one cluster
> - big write: writes that fully fit into one cluster, stored on disk
> For example, one write can be divided into multiple big writes, some
> small writes and some medium writes. Current implementation won't make
> any unnecessary data copying. Data given by caller is either directly
> used to write to disk or is copied as a small write.
>
> Added cluster writer which is used to perform medium writes. Cluster
> writer keeps a current position in the data log and appends new data
> by writing it directly into disk.

>
> Signed-off-by: Michał Niciejewski <qup...@gmail.com>
> ---

> src/fs/cluster_writer.hh | 85 +++++++
> src/fs/metadata_disk_entries.hh | 41 ++-
> src/fs/metadata_log.hh | 16 +-
> src/fs/metadata_log_bootstrap.hh | 8 +
> src/fs/metadata_log_operations/write.hh | 318 ++++++++++++++++++++++++
> src/fs/metadata_to_disk_buffer.hh | 24 ++
> src/fs/metadata_log.cc | 67 ++++-
> src/fs/metadata_log_bootstrap.cc | 103 ++++++++
> CMakeLists.txt | 2 +
> 9 files changed, 660 insertions(+), 4 deletions(-)
> create mode 100644 src/fs/cluster_writer.hh
> create mode 100644 src/fs/metadata_log_operations/write.hh
>
> diff --git a/src/fs/cluster_writer.hh b/src/fs/cluster_writer.hh
> new file mode 100644
> index 00000000..2d2ff917
> --- /dev/null
> +++ b/src/fs/cluster_writer.hh
> @@ -0,0 +1,85 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2020 ScyllaDB
> + */
> +
> +#pragma once
> +

> +#include "fs/bitwise.hh"
> +#include "fs/units.hh"
> +#include "seastar/core/shared_ptr.hh"
> +#include "seastar/fs/block_device.hh"
> +
> +#include <cstdlib>
> +#include <cassert>

> +
> +namespace seastar::fs {
> +

> +// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
> +// in order to finish construction.

> +class cluster_writer {
> +protected:

> + size_t _max_size = 0;
> + unit_size_t _alignment = 0;
> + disk_offset_t _cluster_beg_offset = 0;

> + size_t _next_write_offset = 0;
> +public:
> + cluster_writer() = default;
> +
> + virtual shared_ptr<cluster_writer> virtual_constructor() const {
> + return make_shared<cluster_writer>();
> + }

> +
> + // Total number of bytes appended cannot exceed @p aligned_max_size.
> + // @p cluster_beg_offset is the disk offset of the beginning of the cluster.
> + virtual void init(size_t aligned_max_size, unit_size_t alignment, disk_offset_t cluster_beg_offset) {
> + assert(is_power_of_2(alignment));
> + assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
> + assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
> +
> + _max_size = aligned_max_size;
> + _alignment = alignment;
> + _cluster_beg_offset = cluster_beg_offset;

> + _next_write_offset = 0;
> + }
> +
> + // Writes @p aligned_buffer to @p device just after previous write (or at @p cluster_beg_offset passed to init()
> + // if it is the first write).
> + virtual future<size_t> write(const void* aligned_buffer, size_t aligned_len, block_device device) {
> + assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _alignment == 0);
> + assert(aligned_len % _alignment == 0);
> + assert(aligned_len <= bytes_left());
> +
> + // Make sure the writer is usable before returning from this function
> + size_t curr_write_offset = _next_write_offset;
> + _next_write_offset += aligned_len;
> +
> + return device.write(_cluster_beg_offset + curr_write_offset, aligned_buffer, aligned_len);
> + }
> +
> + virtual size_t bytes_left() const noexcept { return _max_size - _next_write_offset; }

> +
> + // Returns disk offset of the place where the first byte of next appended bytes would be after flush
> + // TODO: maybe better name for that function? Or any other method to extract that data?
> + virtual disk_offset_t current_disk_offset() const noexcept {

> + return _cluster_beg_offset + _next_write_offset;

> + }
> +};
> +
> +} // namespace seastar::fs

> diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
> index 2f363a9b..4422e0b1 100644
> --- a/src/fs/metadata_disk_entries.hh
> +++ b/src/fs/metadata_disk_entries.hh
> @@ -74,6 +74,10 @@ enum ondisk_type : uint8_t {
> NEXT_METADATA_CLUSTER,
> CREATE_INODE,
> DELETE_INODE,
> + SMALL_WRITE,
> + MEDIUM_WRITE,
> + LARGE_WRITE,
> + LARGE_WRITE_WITHOUT_MTIME,
> ADD_DIR_ENTRY,
> CREATE_INODE_AS_DIR_ENTRY,
> DELETE_DIR_ENTRY,
> @@ -111,6 +115,35 @@ struct ondisk_delete_inode {
> inode_t inode;
> } __attribute__((packed));
>
> +struct ondisk_small_write_header {
> + inode_t inode;
> + file_offset_t offset;
> + uint16_t length;
> + decltype(unix_metadata::mtime_ns) time_ns;
> + // After header comes data
> +} __attribute__((packed));
> +
> +struct ondisk_medium_write {
> + inode_t inode;
> + file_offset_t offset;
> + disk_offset_t disk_offset;
> + uint32_t length;
> + decltype(unix_metadata::mtime_ns) time_ns;
> +} __attribute__((packed));
> +
> +struct ondisk_large_write {
> + inode_t inode;
> + file_offset_t offset;
> + cluster_id_t data_cluster; // length == cluster_size
> + decltype(unix_metadata::mtime_ns) time_ns;
> +} __attribute__((packed));
> +

It's possible to have a small write into a data cluster. For example,
one creates a file, writes a bunch of stuff and it gets compacted into a
data cluster. Overwrites should then go into the data cluster rather
than the logs.

So the question is, should small/medium/large identify the size, or the
destination (metadata log, data log, clusters). If the latter, then the
length can be different than the cluster size.

> +struct ondisk_large_write_without_mtime {
> + inode_t inode;
> + file_offset_t offset;
> + cluster_id_t data_cluster; // length == cluster_size
> +} __attribute__((packed));
> +
> struct ondisk_add_dir_entry_header {
> inode_t dir_inode;
> inode_t entry_inode;
> @@ -142,9 +175,15 @@ template<typename T>

> constexpr size_t ondisk_entry_size(const T& entry) noexcept {

> static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or

> std::is_same_v<T, ondisk_create_inode> or
> - std::is_same_v<T, ondisk_delete_inode>, "ondisk entry size not defined for given type");
> + std::is_same_v<T, ondisk_delete_inode> or
> + std::is_same_v<T, ondisk_medium_write> or
> + std::is_same_v<T, ondisk_large_write> or
> + std::is_same_v<T, ondisk_large_write_without_mtime>, "ondisk entry size not defined for given type");
> return sizeof(ondisk_type) + sizeof(entry);
> }
> +constexpr size_t ondisk_entry_size(const ondisk_small_write_header& entry) noexcept {
> + return sizeof(ondisk_type) + sizeof(entry) + entry.length;
> +}

> constexpr size_t ondisk_entry_size(const ondisk_add_dir_entry_header& entry) noexcept {

> return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
> }
> diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh

> index 721e43b8..36e16280 100644
> --- a/src/fs/metadata_log.hh
> +++ b/src/fs/metadata_log.hh
> @@ -23,6 +23,7 @@
>
> #include "fs/cluster.hh"
> #include "fs/cluster_allocator.hh"
> +#include "fs/cluster_writer.hh"
> #include "fs/inode.hh"
> #include "fs/inode_info.hh"
> #include "fs/metadata_disk_entries.hh"
> @@ -54,6 +55,7 @@ class metadata_log {

>
> // Takes care of writing current cluster of serialized metadata log entries to device

> shared_ptr<metadata_to_disk_buffer> _curr_cluster_buff;
> + shared_ptr<cluster_writer> _curr_data_writer;
> shared_future<> _background_futures = now();
>
> // In memory metadata
> @@ -160,10 +162,11 @@ class metadata_log {
> friend class create_file_operation;
> friend class link_file_operation;
> friend class unlink_or_remove_file_operation;
> + friend class write_operation;

>
> public:
> metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,

> - shared_ptr<metadata_to_disk_buffer> cluster_buff);
> + shared_ptr<metadata_to_disk_buffer> cluster_buff, shared_ptr<cluster_writer> data_writer);
>
> metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment);
>
> @@ -181,8 +184,16 @@ class metadata_log {
> return _inodes.count(inode) != 0;
> }
>
> + void write_update(inode_info::file& file, inode_data_vec new_data_vec);
> +
> + // Deletes data vectors that are subset of @p data_range and cuts overlapping data vectors to make them not overlap
> + void cut_out_data_range(inode_info::file& file, file_range range);
> +

> inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);

> void memory_only_delete_inode(inode_t inode);
> + void memory_only_small_write(inode_t inode, disk_offset_t offset, temporary_buffer<uint8_t> data);
> + void memory_only_disk_write(inode_t inode, file_offset_t file_offset, disk_offset_t disk_offset, size_t write_len);
> + void memory_only_update_mtime(inode_t inode, decltype(unix_metadata::mtime_ns) mtime_ns);

> void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);

> void memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name);
>
> @@ -324,6 +335,9 @@ class metadata_log {
>
> future<> close_file(inode_t inode);
>
> + future<size_t> write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
> + const io_priority_class& pc = default_priority_class());

> +
> // All disk-related errors will be exposed here
> future<> flush_log() {
> return flush_curr_cluster();
> diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh

> index 16b429ab..03c2eb9b 100644
> --- a/src/fs/metadata_log_bootstrap.hh
> +++ b/src/fs/metadata_log_bootstrap.hh
> @@ -119,6 +119,14 @@ class metadata_log_bootstrap {
>
> future<> bootstrap_delete_inode();
>
> + future<> bootstrap_small_write();
> +
> + future<> bootstrap_medium_write();
> +
> + future<> bootstrap_large_write();
> +
> + future<> bootstrap_large_write_without_mtime();
> +
> future<> bootstrap_add_dir_entry();
>
> future<> bootstrap_create_inode_as_dir_entry();
> diff --git a/src/fs/metadata_log_operations/write.hh b/src/fs/metadata_log_operations/write.hh
> new file mode 100644
> index 00000000..afe3e2ae
> --- /dev/null
> +++ b/src/fs/metadata_log_operations/write.hh
> @@ -0,0 +1,318 @@

> +/*
> + * This file is open source software, licensed to you under the terms
> + * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
> + * distributed with this work for additional information regarding copyright
> + * ownership. You may not use this file except in compliance with the License.
> + *
> + * You may obtain a copy of the License at
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing,
> + * software distributed under the License is distributed on an
> + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
> + * KIND, either express or implied. See the License for the
> + * specific language governing permissions and limitations
> + * under the License.
> + */
> +/*
> + * Copyright (C) 2020 ScyllaDB
> + */
> +
> +#pragma once
> +

> +#include "fs/bitwise.hh"
> +#include "fs/inode.hh"
> +#include "fs/inode_info.hh"
> +#include "fs/metadata_disk_entries.hh"
> +#include "fs/metadata_log.hh"
> +#include "fs/units.hh"
> +#include "fs/cluster.hh"
> +#include "seastar/core/future-util.hh"
> +#include "seastar/core/future.hh"
> +#include "seastar/core/shared_ptr.hh"
> +#include "seastar/core/temporary_buffer.hh"

> +
> +namespace seastar::fs {
> +

> +class write_operation {
> +public:
> + // TODO: decide about threshold for small write
> + static constexpr size_t SMALL_WRITE_THRESHOLD = std::numeric_limits<decltype(ondisk_small_write_header::length)>::max();
> +
> +private:
> + metadata_log& _metadata_log;
> + inode_t _inode;
> + const io_priority_class& _pc;
> +
> + write_operation(metadata_log& metadata_log, inode_t inode, const io_priority_class& pc)
> + : _metadata_log(metadata_log), _inode(inode), _pc(pc) {
> + assert(_metadata_log._alignment <= SMALL_WRITE_THRESHOLD and
> + "Small write threshold should be at least as big as alignment");
> + }
> +
> + future<size_t> write(const uint8_t* buffer, size_t write_len, file_offset_t file_offset) {
> + auto inode_it = _metadata_log._inodes.find(_inode);
> + if (inode_it == _metadata_log._inodes.end()) {
> + return make_exception_future<size_t>(invalid_inode_exception());
> + }
> + if (inode_it->second.is_directory()) {
> + return make_exception_future<size_t>(is_directory_exception());
> + }
> +
> + // TODO: maybe check if there is enough free clusters before executing?
> + return _metadata_log._locks.with_lock(metadata_log::locks::shared {_inode}, [this, buffer, write_len, file_offset] {
> + if (not _metadata_log.inode_exists(_inode)) {
> + return make_exception_future<size_t>(operation_became_invalid_exception());
> + }
> + return iterate_writes(buffer, write_len, file_offset);
> + });
> + }
> +
> + future<size_t> iterate_writes(const uint8_t* buffer, size_t write_len, file_offset_t file_offset) {
> + return do_with((size_t)0, [this, buffer, write_len, file_offset](size_t& completed_write_len) {
> + return repeat([this, &completed_write_len, buffer, write_len, file_offset] {
> + if (completed_write_len == write_len) {
> + return make_ready_future<bool_class<stop_iteration_tag>>(stop_iteration::yes);
> + }
> +
> + size_t remaining_write_len = write_len - completed_write_len;
> +
> + size_t expected_write_len;
> + if (remaining_write_len <= SMALL_WRITE_THRESHOLD) {
> + expected_write_len = remaining_write_len;
> + } else {
> + if (auto buffer_alignment = mod_by_power_of_2(reinterpret_cast<uintptr_t>(buffer) + completed_write_len,
> + _metadata_log._alignment); buffer_alignment != 0) {
> + // When buffer is not aligned then align it using one small write
> + expected_write_len = _metadata_log._alignment - buffer_alignment;
> + } else {
> + if (remaining_write_len >= _metadata_log._cluster_size) {
> + expected_write_len = _metadata_log._cluster_size;
> + } else {
> + // If the last write is medium then align write length by splitting last write into medium aligned
> + // write and small write
> + expected_write_len = remaining_write_len;
> + }
> + }
> + }
> +
> + auto shifted_buffer = buffer + completed_write_len;
> + auto shifted_file_offset = file_offset + completed_write_len;
> + auto write_future = make_ready_future<size_t>(0);
> + if (expected_write_len <= SMALL_WRITE_THRESHOLD) {

This should also consider whether we already allocated a disk address,
not just the size. Consider someone issuing lots of small writes, the
in-memory metadata will explode and we will fail.

It's okay for now, but for the general case small writes are only good
if the file is small.

> + write_future = do_small_write(shifted_buffer, expected_write_len, shifted_file_offset);
> + } else if (expected_write_len < _metadata_log._cluster_size) {
> + write_future = medium_write(shifted_buffer, expected_write_len, shifted_file_offset);
> + } else {
> + // Update mtime only when it is the first write
> + write_future = do_large_write(shifted_buffer, shifted_file_offset, completed_write_len == 0);
> + }
> +
> + return write_future.then([&completed_write_len, expected_write_len](size_t write_len) {

> + completed_write_len += write_len;
> + if (write_len != expected_write_len) {
> + return stop_iteration::yes;
> + }

> + return stop_iteration::no;
> + });

> + }).then([&completed_write_len] {

This should be a then_wrapped, so if you have partial success, you can
return completed_write_len. Only if the very first write failed should
you propagate the exception.

> + return make_ready_future<size_t>(completed_write_len);

> + });
> + });
> + }
> +

> + static decltype(unix_metadata::mtime_ns) get_current_time_ns() {

> + using namespace std::chrono;

> + return duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
> + }
> +
> + future<size_t> do_small_write(const uint8_t* buffer, size_t expected_write_len, file_offset_t file_offset) {
> + auto curr_time_ns = get_current_time_ns();
> + ondisk_small_write_header ondisk_entry {
> + _inode,
> + file_offset,
> + static_cast<decltype(ondisk_small_write_header::length)>(expected_write_len),
> + curr_time_ns
> + };
> +
> + switch (_metadata_log.append_ondisk_entry(ondisk_entry, buffer)) {

> + case metadata_log::append_result::TOO_BIG:

> + return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());

> + case metadata_log::append_result::NO_SPACE:

> + return make_exception_future<size_t>(no_more_space_exception());

> + case metadata_log::append_result::APPENDED:

> + temporary_buffer<uint8_t> tmp_buffer(buffer, expected_write_len);
> + _metadata_log.memory_only_small_write(_inode, file_offset, std::move(tmp_buffer));
> + _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);

Again we need to handle failure.

> + return make_ready_future<size_t>(expected_write_len);

> + }
> + __builtin_unreachable();
> + }
> +

> + future<size_t> medium_write(const uint8_t* aligned_buffer, size_t expected_write_len, file_offset_t file_offset) {
> + assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
> + // TODO: medium write can be divided into bigger number of smaller writes. Maybe we should add checks
> + // for that and allow only limited number of medium writes? Or we could add to to_disk_buffer option for
> + // space 'reservation' to make sure that after division our write will fit into the buffer?
> + // That would also limit medium write to at most two smaller writes.
> + return do_with((size_t)0, [this, aligned_buffer, expected_write_len, file_offset](size_t& completed_write_len) {
> + return repeat([this, &completed_write_len, aligned_buffer, expected_write_len, file_offset] {
> + if (completed_write_len == expected_write_len) {
> + return make_ready_future<bool_class<stop_iteration_tag>>(stop_iteration::yes);
> + }
> +
> + size_t remaining_write_len = expected_write_len - completed_write_len;
> + size_t curr_expected_write_len;
> + auto shifted_buffer = aligned_buffer + completed_write_len;
> + auto shifted_file_offset = file_offset + completed_write_len;
> + auto write_future = make_ready_future<size_t>(0);
> + if (remaining_write_len <= SMALL_WRITE_THRESHOLD) {
> + // We can use small write for the remaining data
> + curr_expected_write_len = remaining_write_len;
> + write_future = do_small_write(shifted_buffer, curr_expected_write_len, shifted_file_offset);
> + } else {
> + size_t rounded_remaining_write_len =
> + round_down_to_multiple_of_power_of_2(remaining_write_len, _metadata_log._alignment);
> +
> + // We must use medium write
> + size_t buff_bytes_left = _metadata_log._curr_data_writer->bytes_left();
> + if (buff_bytes_left <= SMALL_WRITE_THRESHOLD) {
> + // TODO: add wasted buff_bytes_left bytes for compaction
> + // No space left in the current to_disk_buffer for medium write - allocate a new buffer
> + std::optional<cluster_id_t> cluster_opt = _metadata_log._cluster_allocator.alloc();
> + if (not cluster_opt) {
> + // TODO: maybe we should return partial write instead of exception?

Yes

> + return make_exception_future<bool_class<stop_iteration_tag>>(no_more_space_exception());
> + }
> +
> + auto cluster_id = cluster_opt.value();
> + disk_offset_t cluster_disk_offset = cluster_id_to_offset(cluster_id, _metadata_log._cluster_size);
> + _metadata_log._curr_data_writer = _metadata_log._curr_data_writer->virtual_constructor();
> + _metadata_log._curr_data_writer->init(_metadata_log._cluster_size, _metadata_log._alignment,
> + cluster_disk_offset);
> + buff_bytes_left = _metadata_log._curr_data_writer->bytes_left();
> +
> + curr_expected_write_len = rounded_remaining_write_len;
> + } else {
> + // There is enough space for medium write
> + curr_expected_write_len = buff_bytes_left >= rounded_remaining_write_len ?
> + rounded_remaining_write_len : buff_bytes_left;
> + }
> +
> + write_future = do_medium_write(shifted_buffer, curr_expected_write_len, shifted_file_offset,
> + _metadata_log._curr_data_writer);
> + }
> +
> + return write_future.then([&completed_write_len, curr_expected_write_len](size_t write_len) {
> + completed_write_len += write_len;
> + if (write_len != curr_expected_write_len) {
> + return stop_iteration::yes;
> + }

> + return stop_iteration::no;
> + });

> + }).then([&completed_write_len] {
> + return make_ready_future<size_t>(completed_write_len);

> + });;
> + });
> + }
> +

> + future<size_t> do_medium_write(const uint8_t* aligned_buffer, size_t aligned_expected_write_len, file_offset_t file_offset,
> + shared_ptr<cluster_writer> disk_buffer) {
> + assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
> + assert(aligned_expected_write_len % _metadata_log._alignment == 0);
> + assert(disk_buffer->bytes_left() >= aligned_expected_write_len);
> +
> + disk_offset_t device_offset = disk_buffer->current_disk_offset();
> + return disk_buffer->write(aligned_buffer, aligned_expected_write_len, _metadata_log._device).then(
> + [this, file_offset, disk_buffer = std::move(disk_buffer), device_offset](size_t write_len) {
> + // TODO: is this round down necessary?
> + // On partial write return aligned write length
> + write_len = round_down_to_multiple_of_power_of_2(write_len, _metadata_log._alignment);
> +
> + auto curr_time_ns = get_current_time_ns();
> + ondisk_medium_write ondisk_entry {
> + _inode,
> + file_offset,
> + device_offset,
> + static_cast<decltype(ondisk_medium_write::length)>(write_len),
> + curr_time_ns
> + };
> +
> + switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {

> + case metadata_log::append_result::TOO_BIG:

> + return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());

> + case metadata_log::append_result::NO_SPACE:

> + return make_exception_future<size_t>(no_more_space_exception());

> + case metadata_log::append_result::APPENDED:

> + _metadata_log.memory_only_disk_write(_inode, file_offset, device_offset, write_len);

Note we should update the data log fragmentation record here so we know
when to compact it.

> + _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
> + return make_ready_future<size_t>(write_len);
> + }
> + __builtin_unreachable();
> + });
> + }
> +
> + future<size_t> do_large_write(const uint8_t* aligned_buffer, file_offset_t file_offset, bool update_mtime) {
> + assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
> + // aligned_expected_write_len = _metadata_log._cluster_size
> + std::optional<cluster_id_t> cluster_opt = _metadata_log._cluster_allocator.alloc();
> + if (not cluster_opt) {
> + return make_exception_future<size_t>(no_more_space_exception());
> + }
> + auto cluster_id = cluster_opt.value();
> + disk_offset_t cluster_disk_offset = cluster_id_to_offset(cluster_id, _metadata_log._cluster_size);
> +
> + return _metadata_log._device.write(cluster_disk_offset, aligned_buffer, _metadata_log._cluster_size, _pc).then(
> + [this, file_offset, cluster_id, cluster_disk_offset, update_mtime](size_t write_len) {
> + if (write_len != _metadata_log._cluster_size) {
> + _metadata_log._cluster_allocator.free(cluster_id);
> + return make_ready_future<size_t>(0);
> + }
> +
> + metadata_log::append_result append_result;
> + if (update_mtime) {
> + auto curr_time_ns = get_current_time_ns();
> + ondisk_large_write ondisk_entry {
> + _inode,
> + file_offset,
> + cluster_id,
> + curr_time_ns
> + };
> + append_result = _metadata_log.append_ondisk_entry(ondisk_entry);
> + if (append_result == metadata_log::append_result::APPENDED) {
> + _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
> + }
> + } else {
> + ondisk_large_write_without_mtime ondisk_entry {
> + _inode,
> + file_offset,
> + cluster_id
> + };
> + append_result = _metadata_log.append_ondisk_entry(ondisk_entry);
> + }
> +
> + switch (append_result) {

> + case metadata_log::append_result::TOO_BIG:

> + return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());

> + case metadata_log::append_result::NO_SPACE:

> + _metadata_log._cluster_allocator.free(cluster_id);
> + return make_exception_future<size_t>(no_more_space_exception());

> + case metadata_log::append_result::APPENDED:

> + _metadata_log.memory_only_disk_write(_inode, file_offset, cluster_disk_offset, write_len);
> + return make_ready_future<size_t>(write_len);
> + }
> + __builtin_unreachable();
> + });
> + }
> +
> +public:
> + static future<size_t> perform(metadata_log& metadata_log, inode_t inode, file_offset_t pos, const void* buffer,
> + size_t len, const io_priority_class& pc) {
> + return do_with(write_operation(metadata_log, inode, pc), [buffer, len, pos](auto& obj) {
> + return obj.write(static_cast<const uint8_t*>(buffer), len, pos);

> + });
> + }
> +};
> +
> +} // namespace seastar::fs
> diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh

> index 979a03c2..6a71d96e 100644
> --- a/src/fs/metadata_to_disk_buffer.hh
> +++ b/src/fs/metadata_to_disk_buffer.hh
> @@ -161,6 +161,30 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
> return append_simple(DELETE_INODE, delete_inode);
> }
>
> + [[nodiscard]] virtual append_result append(const ondisk_small_write_header& small_write, const void* data) noexcept {
> + ondisk_type type = SMALL_WRITE;
> + if (not fits_for_append(ondisk_entry_size(small_write))) {

> + return TOO_BIG;
> + }
> +
> + append_bytes(&type, sizeof(type));

> + append_bytes(&small_write, sizeof(small_write));
> + append_bytes(data, small_write.length);

> + return APPENDED;
> + }
> +

> + [[nodiscard]] virtual append_result append(const ondisk_medium_write& medium_write) noexcept {
> + return append_simple(MEDIUM_WRITE, medium_write);
> + }
> +
> + [[nodiscard]] virtual append_result append(const ondisk_large_write& large_write) noexcept {
> + return append_simple(LARGE_WRITE, large_write);
> + }
> +
> + [[nodiscard]] virtual append_result append(const ondisk_large_write_without_mtime& large_write_without_mtime) noexcept {
> + return append_simple(LARGE_WRITE_WITHOUT_MTIME, large_write_without_mtime);
> + }
> +

> [[nodiscard]] virtual append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept {

> ondisk_type type = ADD_DIR_ENTRY;
> if (not fits_for_append(ondisk_entry_size(add_dir_entry))) {
> diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
> index 56954cf1..70434a4a 100644
> --- a/src/fs/metadata_log.cc
> +++ b/src/fs/metadata_log.cc
> @@ -30,6 +30,7 @@
> #include "fs/metadata_log_operations/create_file.hh"
> #include "fs/metadata_log_operations/link_file.hh"
> #include "fs/metadata_log_operations/unlink_or_remove_file.hh"
> +#include "fs/metadata_log_operations/write.hh"

> #include "fs/metadata_to_disk_buffer.hh"
> #include "fs/path.hh"
> #include "fs/units.hh"

> @@ -57,11 +58,12 @@
> namespace seastar::fs {

>
> metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t alignment,

> - shared_ptr<metadata_to_disk_buffer> cluster_buff)
> + shared_ptr<metadata_to_disk_buffer> cluster_buff, shared_ptr<cluster_writer> data_writer)
> : _device(std::move(device))
> , _cluster_size(cluster_size)
> , _alignment(alignment)
> , _curr_cluster_buff(std::move(cluster_buff))
> +, _curr_data_writer(std::move(data_writer))
> , _cluster_allocator({}, {})
> , _inode_allocator(1, 0) {
> assert(is_power_of_2(alignment));
> @@ -70,7 +72,7 @@ metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t
>
> metadata_log::metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment)
> : metadata_log(std::move(device), cluster_size, alignment,
> - make_shared<metadata_to_disk_buffer>()) {}
> + make_shared<metadata_to_disk_buffer>(), make_shared<cluster_writer>()) {}
>
> future<> metadata_log::bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
> fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
> @@ -84,6 +86,27 @@ future<> metadata_log::shutdown() {
> });
> }
>
> +void metadata_log::write_update(inode_info::file& file, inode_data_vec new_data_vec) {
> + // TODO: for compaction: update used inode_data_vec
> + auto file_size = file.size();
> + if (file_size < new_data_vec.data_range.beg) {
> + file.data.emplace(file_size, inode_data_vec {
> + {file_size, new_data_vec.data_range.beg},
> + inode_data_vec::hole_data {}
> + });
> + } else {
> + cut_out_data_range(file, new_data_vec.data_range);
> + }
> +
> + file.data.emplace(new_data_vec.data_range.beg, std::move(new_data_vec));
> +}
> +
> +void metadata_log::cut_out_data_range(inode_info::file& file, file_range range) {
> + file.cut_out_data_range(range, [](inode_data_vec data_vec) {
> + (void)data_vec; // TODO: for compaction: update used inode_data_vec
> + });
> +}
> +
> inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata) {
> assert(_inodes.count(inode) == 0);
> return _inodes.emplace(inode, inode_info {
> @@ -118,6 +141,41 @@ void metadata_log::memory_only_delete_inode(inode_t inode) {
> _inodes.erase(it);
> }
>
> +void metadata_log::memory_only_small_write(inode_t inode, file_offset_t file_offset, temporary_buffer<uint8_t> data) {
> + inode_data_vec data_vec = {
> + {file_offset, file_offset + data.size()},
> + inode_data_vec::in_mem_data {std::move(data)}
> + };
> +
> + auto it = _inodes.find(inode);
> + assert(it != _inodes.end());
> + assert(it->second.is_file());
> + write_update(it->second.get_file(), std::move(data_vec));
> +}
> +
> +void metadata_log::memory_only_disk_write(inode_t inode, file_offset_t file_offset, disk_offset_t disk_offset,
> + size_t write_len) {
> + inode_data_vec data_vec = {
> + {file_offset, file_offset + write_len},
> + inode_data_vec::on_disk_data {disk_offset}
> + };
> +
> + auto it = _inodes.find(inode);
> + assert(it != _inodes.end());
> + assert(it->second.is_file());
> + write_update(it->second.get_file(), std::move(data_vec));
> +}
> +
> +void metadata_log::memory_only_update_mtime(inode_t inode, decltype(unix_metadata::mtime_ns) mtime_ns) {

> + auto it = _inodes.find(inode);

> + assert(it != _inodes.end());
> + it->second.metadata.mtime_ns = mtime_ns;
> + // ctime should be updated when contents is modified
> + if (it->second.metadata.ctime_ns < mtime_ns) {
> + it->second.metadata.ctime_ns = mtime_ns;
> + }
> +}
> +

> void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {

> auto it = _inodes.find(entry_inode);
> assert(it != _inodes.end());
> @@ -381,6 +439,11 @@ future<> metadata_log::close_file(inode_t inode) {
> });
> }
>
> +future<size_t> metadata_log::write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
> + const io_priority_class& pc) {
> + return write_operation::perform(*this, inode, pos, buffer, len, pc);

> +}
> +
> // TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
> // then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
> // without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
> diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc

> index 3120fbd4..52354181 100644
> --- a/src/fs/metadata_log_bootstrap.cc
> +++ b/src/fs/metadata_log_bootstrap.cc
> @@ -111,8 +111,13 @@ future<> metadata_log_bootstrap::bootstrap(cluster_id_t first_metadata_cluster_i
> if (free_clusters.empty()) {
> return make_exception_future(no_more_space_exception());
> }
> + cluster_id_t datalog_cluster_id = free_clusters.front();
> free_clusters.pop_front();
>
> + _metadata_log._curr_data_writer = _metadata_log._curr_data_writer->virtual_constructor();
> + _metadata_log._curr_data_writer->init(_metadata_log._cluster_size, _metadata_log._alignment,
> + cluster_id_to_offset(datalog_cluster_id, _metadata_log._cluster_size));

> +
> mlogger.debug("free clusters: {}", free_clusters.size());

> _metadata_log._cluster_allocator = cluster_allocator(std::move(_taken_clusters), std::move(free_clusters));
>

> @@ -215,6 +220,14 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {

> return bootstrap_create_inode();
> case DELETE_INODE:
> return bootstrap_delete_inode();

> + case SMALL_WRITE:
> + return bootstrap_small_write();
> + case MEDIUM_WRITE:
> + return bootstrap_medium_write();
> + case LARGE_WRITE:
> + return bootstrap_large_write();
> + case LARGE_WRITE_WITHOUT_MTIME:
> + return bootstrap_large_write_without_mtime();
> case ADD_DIR_ENTRY:
> return bootstrap_add_dir_entry();
> case CREATE_INODE_AS_DIR_ENTRY:
> @@ -284,6 +297,96 @@ future<> metadata_log_bootstrap::bootstrap_delete_inode() {
> return now();
> }
>
> +future<> metadata_log_bootstrap::bootstrap_small_write() {
> + ondisk_small_write_header entry;
> + if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

> + return invalid_entry_exception();
> + }
> +

> + if (not _metadata_log._inodes[entry.inode].is_file()) {

> + return invalid_entry_exception();
> + }
> +

> + auto data_opt = _curr_checkpoint.read_tmp_buff(entry.length);
> + if (not data_opt) {
> + return invalid_entry_exception();
> + }
> + temporary_buffer<uint8_t>& data = *data_opt;
> +
> + _metadata_log.memory_only_small_write(entry.inode, entry.offset, std::move(data));
> + _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);

> + return now();
> +}
> +

> +future<> metadata_log_bootstrap::bootstrap_medium_write() {
> + ondisk_medium_write entry;
> + if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

> + return invalid_entry_exception();
> + }
> +

> + if (not _metadata_log._inodes[entry.inode].is_file()) {

> + return invalid_entry_exception();
> + }
> +

> + cluster_id_t data_cluster_id = offset_to_cluster_id(entry.disk_offset, _metadata_log._cluster_size);
> + if (_available_clusters.beg > data_cluster_id or
> + _available_clusters.end <= data_cluster_id) {
> + return invalid_entry_exception();
> + }
> + // TODO: we could check overlapping with other writes
> + _taken_clusters.emplace(data_cluster_id);
> +
> + _metadata_log.memory_only_disk_write(entry.inode, entry.offset, entry.disk_offset, entry.length);
> + _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);

> + return now();
> +}
> +

> +future<> metadata_log_bootstrap::bootstrap_large_write() {
> + ondisk_large_write entry;
> + if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

> + return invalid_entry_exception();
> + }
> +

> + if (not _metadata_log._inodes[entry.inode].is_file()) {

> + return invalid_entry_exception();
> + }
> +

> + if (_available_clusters.beg > entry.data_cluster or
> + _available_clusters.end <= entry.data_cluster or
> + _taken_clusters.count(entry.data_cluster) != 0) {
> + return invalid_entry_exception();
> + }
> + _taken_clusters.emplace((cluster_id_t)entry.data_cluster);
> +
> + _metadata_log.memory_only_disk_write(entry.inode, entry.offset,
> + cluster_id_to_offset(entry.data_cluster, _metadata_log._cluster_size), _metadata_log._cluster_size);
> + _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);

> + return now();
> +}
> +

> +// TODO: copy pasting :(
> +future<> metadata_log_bootstrap::bootstrap_large_write_without_mtime() {
> + ondisk_large_write_without_mtime entry;
> + if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {

> + return invalid_entry_exception();
> + }
> +

> + if (not _metadata_log._inodes[entry.inode].is_file()) {

> + return invalid_entry_exception();
> + }
> +

> + if (_available_clusters.beg > entry.data_cluster or
> + _available_clusters.end <= entry.data_cluster or
> + _taken_clusters.count(entry.data_cluster) != 0) {
> + return invalid_entry_exception();
> + }
> + _taken_clusters.emplace((cluster_id_t)entry.data_cluster);
> +
> + _metadata_log.memory_only_disk_write(entry.inode, entry.offset,
> + cluster_id_to_offset(entry.data_cluster, _metadata_log._cluster_size), _metadata_log._cluster_size);

> + return now();
> +}
> +

> future<> metadata_log_bootstrap::bootstrap_add_dir_entry() {
> ondisk_add_dir_entry_header entry;

> if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
> diff --git a/CMakeLists.txt b/CMakeLists.txt

> index e432e572..840a02aa 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -668,6 +668,7 @@ if (Seastar_EXPERIMENTAL_FS)
> src/fs/cluster.hh
> src/fs/cluster_allocator.cc
> src/fs/cluster_allocator.hh
> + src/fs/cluster_writer.hh
> src/fs/crc.hh
> src/fs/file.cc
> src/fs/inode.hh
> @@ -681,6 +682,7 @@ if (Seastar_EXPERIMENTAL_FS)
> src/fs/metadata_log_operations/create_file.hh
> src/fs/metadata_log_operations/link_file.hh
> src/fs/metadata_log_operations/unlink_or_remove_file.hh
> + src/fs/metadata_log_operations/write.hh
> src/fs/metadata_to_disk_buffer.hh
> src/fs/path.hh
> src/fs/range.hh

Avi Kivity

<avi@scylladb.com>

unread,

Apr 22, 2020, 10:23:51 AM4/22/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/20/20 3:01 PM, Krzysztof Małysa wrote:

Very good, looks promising.

Avi Kivity

<avi@scylladb.com>

unread,

Apr 23, 2020, 1:58:49 AM4/23/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

On 4/22/20 4:27 PM, Avi Kivity wrote:
>
>> + switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {
>> + case metadata_log::append_result::TOO_BIG:
>> + return
>> make_exception_future<inode_t>(cluster_size_too_small_to_perform_operation_exception());
>> + case metadata_log::append_result::NO_SPACE:
>> + return
>> make_exception_future<inode_t>(no_more_space_exception());
>> + case metadata_log::append_result::APPENDED:
>> + inode_info& new_inode_info =
>> _metadata_log.memory_only_create_inode(new_inode, false, unx_mtdt);
>
>
> What if this fails?
>
>
> Our options are:
>
> - first create the new inode in memory (but locked), then either
> unlock it or back it out
>
> - undo the append somehow

One way to implement undo/redo is to use C++17
std::map::extract/std::map::merge. You can extract the existing data and
move it to a temporary map, and on failure remove the new data and move
the old data back. Or alternatively, construct the new data in a
temporary map, and after that's successful, append the log, then splice
the new data into the main map.

Piotr Sarna

<sarna@scylladb.com>

unread,

Apr 23, 2020, 3:02:29 AM4/23/20

to Avi Kivity, Krzysztof Małysa, seastar-dev@googlegroups.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com

Hm, I remember placing the exact same comment here in an earlier
iteration, maybe it slipped through... found it:
https://github.com/psarna/seastar/pull/61/files#r389360036 . I agree
that it can also be fixed as a TODO for now.

Krzysztof Małysa

<varqox@gmail.com>

unread,

May 17, 2020, 8:12:55 AM5/17/20

to Avi Kivity, seastar-dev@googlegroups.com, Michał Niciejewski, Piotr Sarna, Aleksander Sorokin, Wojciech Mitros

On Wed, 22 Apr 2020 at 12:17, Avi Kivity <a...@scylladb.com> wrote:

> + static constexpr uint32_t max_shards_nb = 500;
> + static constexpr unit_size_t min_alignment = 4096;
> +
> + struct shard_info {
> + cluster_id_t metadata_cluster; /// cluster id of the first metadata log cluster
> + cluster_range available_clusters; /// range of clusters for data for this shard
> + };
> +
> + uint64_t version; /// file system version
> + unit_size_t alignment; /// write alignment in bytes
> + unit_size_t cluster_size; /// cluster size in bytes
> + inode_t root_directory; /// root dir inode number
> + std::vector<shard_info> shards_info; /// basic informations about each file system shard
> +

Please follow the all-or-nothing public member principle.

Could you provide the rationale behind this principle? It seems like a good opportunity to improve my general coding style.

Krzysztof Małysa

<varqox@gmail.com>

unread,

May 17, 2020, 8:15:30 AM5/17/20

to Avi Kivity, seastar-dev@googlegroups.com, Piotr Sarna, Aleksander Sorokin, Michał Niciejewski, Wojciech Mitros

Sadly, std::filesystem::path is very inefficient when performing such simple tasks. That is because each path component

is copied to a separate std::filesystem::path (with its value as a string), so it results in (at least) as many allocations as the

number of path components. It is the way it is because C++ standard imposes that decomposing functions (that give

path components as std::path) must not throw.

Given the above, the fact that the function is easy to write, and the fact that it is not used rarely, I chose to implement it.

Avi Kivity

<avi@scylladb.com>

unread,

May 17, 2020, 8:32:09 AM5/17/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, Michał Niciejewski, Piotr Sarna, Aleksander Sorokin, Wojciech Mitros

A struct (plain public members, no methods) makes it clear that the user is invited to mutate it directly, and that the members are orthogonal, and that sensible defaults are provide. See file_open_options for a typical example.

A class (private members, methods for everything) makes it clear that the class enforces some invariants.

A mixed layout is inviting confusion, especially as the class is evolved.

In this case I think a struct is a good choice. You can drop the constructor in favor of C++20 designated initializers, and even default operator<=> to get the comparisons. The few other methods can be made statics or free functions.

Avi Kivity

<avi@scylladb.com>

unread,

May 17, 2020, 8:34:51 AM5/17/20

to Krzysztof Małysa, seastar-dev@googlegroups.com, Piotr Sarna, Aleksander Sorokin, Michał Niciejewski, Wojciech Mitros

On 17/05/2020 15.15, Krzysztof Małysa wrote:

Sadly, std::filesystem::path is very inefficient when performing such simple tasks. That is because each path component
is copied to a separate std::filesystem::path (with its value as a string), so it results in (at least) as many allocations as the

number of path components. It is the way it is because C++ standard imposes that decomposing functions (that give

path components as std::path) must not throw.

Given the above, the fact that the function is easy to write, and the fact that it is not used rarely, I chose to implement it.