[RFC PATCH 01/34] fs: prepare fs/ directory and conditional compilation

221 views
Skip to first unread message

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:18 AM4/20/20
to seastar-dev@googlegroups.com, Piotr Sarna, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
From: Piotr Sarna <sa...@scylladb.com>

This patch provides the initial infrastructure for future
SeastarFS (Seastar filesystem) patches.
Since the project is in very early stage and is going to require
C++17 features, it's disabled by default and can only be enabled
manually by configuring with --enable-experimental-fs
or defining a CMake flag -DSeastar_EXPERIMENTAL_FS=ON.
---
configure.py | 6 ++++++
CMakeLists.txt | 12 ++++++++++++
src/fs/README.md | 10 ++++++++++
3 files changed, 28 insertions(+)
create mode 100644 src/fs/README.md

diff --git a/configure.py b/configure.py
index bbc9f908..6ec350ef 100755
--- a/configure.py
+++ b/configure.py
@@ -106,6 +106,11 @@ add_tristate(
name = 'unused-result-error',
dest = "unused_result_error",
help = 'Make [[nodiscard]] violations an error')
+add_tristate(
+ arg_parser,
+ name = 'experimental-fs',
+ dest = "experimental_fs",
+ help = 'experimental support for SeastarFS')
arg_parser.add_argument('--allocator-page-size', dest='alloc_page_size', type=int, help='override allocator page size')
arg_parser.add_argument('--without-tests', dest='exclude_tests', action='store_true', help='Do not build tests by default')
arg_parser.add_argument('--without-apps', dest='exclude_apps', action='store_true', help='Do not build applications by default')
@@ -201,6 +206,7 @@ def configure_mode(mode):
tr(args.heap_profiling, 'HEAP_PROFILING'),
tr(args.coroutines_ts, 'EXPERIMENTAL_COROUTINES_TS'),
tr(args.unused_result_error, 'UNUSED_RESULT_ERROR'),
+ tr(args.experimental_fs, 'EXPERIMENTAL_FS'),
]

ingredients_to_cook = set(args.cook)
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 39ae46aa..be4f02c8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -281,6 +281,10 @@ set (Seastar_STACK_GUARDS
STRING
"Enable stack guards. Can be ON, OFF or DEFAULT (which enables it for non release builds)")

+option (Seastar_EXPERIMENTAL_FS
+ "Compile experimental SeastarFS sources (requires C++17 support)"
+ OFF)
+
# When Seastar is embedded with `add_subdirectory`, disable the non-library targets.
if (NOT (CMAKE_CURRENT_SOURCE_DIR STREQUAL CMAKE_SOURCE_DIR))
set (Seastar_APPS OFF)
@@ -648,6 +652,14 @@ add_library (seastar STATIC

add_library (Seastar::seastar ALIAS seastar)

+if (Seastar_EXPERIMENTAL_FS)
+ message(STATUS "Experimental SeastarFS is enabled")
+ target_sources(seastar
+ PRIVATE
+ # SeastarFS source files
+ )
+endif()
+
add_dependencies (seastar
seastar_http_request_parser
seastar_http_response_parser
diff --git a/src/fs/README.md b/src/fs/README.md
new file mode 100644
index 00000000..630f34a8
--- /dev/null
+++ b/src/fs/README.md
@@ -0,0 +1,10 @@
+### SeastarFS ###
+
+SeastarFS is an R&D project aimed at providing a fully asynchronous,
+log-structured, shard-friendly file system optimized for large files
+and with native Seastar support.
+
+Source files residing in this directory will be compiled only
+by setting an appropriate flag.
+ninja: ./configure.py --enable-experimental-fs
+CMake: -DSeastar\_EXPERIMENTAL\_FS=ON
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:18 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
github: https://github.com/psarna/seastar/commits/fs-metadata-log

This series is part of the ZPP FS project that is coordinated by Piotr Sarna <sa...@scylladb.com>.
The goal of this project is to create SeastarFS -- a fully asynchronous, sharded, user-space,
log-structured file system that is intended to become an alternative to XFS for Scylla.

The filesystem is optimized for:
- NVMe SSD storage
- large files
- appending files

For efficiency all metadata is stored in the memory. Metadata holds information about where each
part of the file is located and about the directory tree.

Whole filesystem is divided into filesystem shards (typically the same as number of seastar shards
for efficiency). Each shard holds its own part of the filesystem. Sharding is done by the fact that
each shard has its set of root subdirectories that it exclusively owns (one of the shards is an
owner of the root directory itself).

Every shard will have 3 private logs:
- metadata log -- holds metadata and very small writes
- medium data log -- holds medium-sized writes, which can combine data from different files
- big data log -- holds data clusters, each of which belongs to a single file (this is not
actually a log, but in the big picture it looks like it was)

Disk space is divided into clusters (typically around several MiB) that
have all equal size that is multiple of alignment (typically 4096
bytes). Each shard has its private pool of clusters (assignment is
stored in bootstrap record). Each log consumes clusters one by one -- it
writes the current one and if cluster becomes full, then log switches to
a new one that is obtained from a pool of free clusters managed by
cluster_allocator. Metadata log and medium data log write data in the
same manner: they fill up the cluster gradually from left to right. Big
data log takes a cluster and completely fills it with data at once -- it
is only used during big writes.

metadata_log is in fact a standalone file system instance that provides lower level interface
(paths and inodes) of shard's own part of the filesystem. It manages all 3 logs mentioned above and
maintains all metadata about its part of the file system that include data structures for directory
structure and file content, locking logic for safe concurrent usage, buffers for writing logs to
disk, and bootstrapping -- restoring file system structure from disk.

This patch implements:
- bootstrap record -- our equivalent of the filesystem superblock -- it contains information like
size of the block device, number of filesystem shards, cluster distribution among shards
- cluster allocator for managing free clusters within one metadata_log
- fully functional metadata_log that will be one shard's part of the filesystem
- bootstrapping metadata_log
- creating / deleting file and directories (+ support for unlinked files)
- reading, writing and truncating files
- opening and closing files
- linking files (but not directories)
- iterating directory and getting file attributes
- tests of some components and functionality of the metadata_log and bootstrap record

What is not here, but we plan on pushing it later:
- compaction
- filesystem sharding
- renaming files

Tests: unit(dev)

Aleksander Sorokin (3):
fs: add initial file implementation
tests: fs: add parallel i/o unit test for seastarfs file
tests: fs: add basic test for metadata log bootstrapping

Krzysztof Małysa (14):
fs: add initial block_device implementation
fs: add temporary_file
tests: fs: add block_device unit test
fs: add unit headers
fs: add seastar/fs/overloaded.hh
fs: add seastar/fs/path.hh with unit tests
fs: add value_shared_lock.hh
fs: metadata_log: add base implementation
fs: metadata_log: add operation for creating and opening unlinked file
fs: metadata_log: add creating files and directories
fs: metadata_log: add private operation for deleting inode
fs: metadata_log: add link operation
fs: metadata_log: add unlinking files and removing directories
fs: metadata_log: add stat() operation

Michał Niciejewski (10):
fs: add bootstrap record implementation
tests: fs: add tests for bootstrap record
fs: metadata_log: add opening file
fs: metadata_log: add closing file
fs: metadata_log: add write operation
fs: metadata_log: add read operation
tests: fs: added metadata_to_disk_buffer and cluster_writer mockers
tests: fs: add write test
fs: read: add optimization for aligned reads
tests: fs: add tests for aligned reads and writes

Piotr Sarna (1):
fs: prepare fs/ directory and conditional compilation

Wojciech Mitros (6):
fs: add cluster allocator
fs: add cluster allocator tests
fs: metadata_log: add truncate operation
tests: fs: add to_disk_buffer test
tests: fs: add truncate operation test
tests: fs: add metadata_to_disk_buffer unit tests

configure.py | 6 +
include/seastar/fs/block_device.hh | 102 +++
include/seastar/fs/exceptions.hh | 88 ++
include/seastar/fs/file.hh | 55 ++
include/seastar/fs/overloaded.hh | 26 +
include/seastar/fs/stat.hh | 41 +
include/seastar/fs/temporary_file.hh | 54 ++
src/fs/bitwise.hh | 125 +++
src/fs/bootstrap_record.hh | 98 ++
src/fs/cluster.hh | 42 +
src/fs/cluster_allocator.hh | 50 ++
src/fs/cluster_writer.hh | 85 ++
src/fs/crc.hh | 34 +
src/fs/device_reader.hh | 91 ++
src/fs/inode.hh | 80 ++
src/fs/inode_info.hh | 221 +++++
src/fs/metadata_disk_entries.hh | 208 +++++
src/fs/metadata_log.hh | 362 ++++++++
src/fs/metadata_log_bootstrap.hh | 145 +++
.../create_and_open_unlinked_file.hh | 77 ++
src/fs/metadata_log_operations/create_file.hh | 174 ++++
src/fs/metadata_log_operations/link_file.hh | 112 +++
src/fs/metadata_log_operations/read.hh | 138 +++
src/fs/metadata_log_operations/truncate.hh | 90 ++
.../unlink_or_remove_file.hh | 196 ++++
src/fs/metadata_log_operations/write.hh | 318 +++++++
src/fs/metadata_to_disk_buffer.hh | 244 +++++
src/fs/path.hh | 42 +
src/fs/range.hh | 61 ++
src/fs/to_disk_buffer.hh | 138 +++
src/fs/units.hh | 40 +
src/fs/unix_metadata.hh | 40 +
src/fs/value_shared_lock.hh | 65 ++
tests/unit/fs_metadata_common.hh | 467 ++++++++++
tests/unit/fs_mock_block_device.hh | 55 ++
tests/unit/fs_mock_cluster_writer.hh | 78 ++
tests/unit/fs_mock_metadata_to_disk_buffer.hh | 323 +++++++
src/fs/bootstrap_record.cc | 206 +++++
src/fs/cluster_allocator.cc | 54 ++
src/fs/device_reader.cc | 199 +++++
src/fs/file.cc | 108 +++
src/fs/metadata_log.cc | 525 +++++++++++
src/fs/metadata_log_bootstrap.cc | 552 ++++++++++++
tests/unit/fs_block_device_test.cc | 206 +++++
tests/unit/fs_bootstrap_record_test.cc | 414 +++++++++
tests/unit/fs_cluster_allocator_test.cc | 115 +++
tests/unit/fs_log_bootstrap_test.cc | 86 ++
tests/unit/fs_metadata_to_disk_buffer_test.cc | 462 ++++++++++
tests/unit/fs_mock_block_device.cc | 50 ++
.../fs_mock_metadata_to_disk_buffer_test.cc | 357 ++++++++
tests/unit/fs_path_test.cc | 90 ++
tests/unit/fs_seastarfs_test.cc | 62 ++
tests/unit/fs_to_disk_buffer_test.cc | 160 ++++
tests/unit/fs_truncate_test.cc | 171 ++++
tests/unit/fs_write_test.cc | 835 ++++++++++++++++++
CMakeLists.txt | 50 ++
src/fs/README.md | 10 +
tests/unit/CMakeLists.txt | 42 +
58 files changed, 9325 insertions(+)
create mode 100644 include/seastar/fs/block_device.hh
create mode 100644 include/seastar/fs/exceptions.hh
create mode 100644 include/seastar/fs/file.hh
create mode 100644 include/seastar/fs/overloaded.hh
create mode 100644 include/seastar/fs/stat.hh
create mode 100644 include/seastar/fs/temporary_file.hh
create mode 100644 src/fs/bitwise.hh
create mode 100644 src/fs/bootstrap_record.hh
create mode 100644 src/fs/cluster.hh
create mode 100644 src/fs/cluster_allocator.hh
create mode 100644 src/fs/cluster_writer.hh
create mode 100644 src/fs/crc.hh
create mode 100644 src/fs/device_reader.hh
create mode 100644 src/fs/inode.hh
create mode 100644 src/fs/inode_info.hh
create mode 100644 src/fs/metadata_disk_entries.hh
create mode 100644 src/fs/metadata_log.hh
create mode 100644 src/fs/metadata_log_bootstrap.hh
create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
create mode 100644 src/fs/metadata_log_operations/create_file.hh
create mode 100644 src/fs/metadata_log_operations/link_file.hh
create mode 100644 src/fs/metadata_log_operations/read.hh
create mode 100644 src/fs/metadata_log_operations/truncate.hh
create mode 100644 src/fs/metadata_log_operations/unlink_or_remove_file.hh
create mode 100644 src/fs/metadata_log_operations/write.hh
create mode 100644 src/fs/metadata_to_disk_buffer.hh
create mode 100644 src/fs/path.hh
create mode 100644 src/fs/range.hh
create mode 100644 src/fs/to_disk_buffer.hh
create mode 100644 src/fs/units.hh
create mode 100644 src/fs/unix_metadata.hh
create mode 100644 src/fs/value_shared_lock.hh
create mode 100644 tests/unit/fs_metadata_common.hh
create mode 100644 tests/unit/fs_mock_block_device.hh
create mode 100644 tests/unit/fs_mock_cluster_writer.hh
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer.hh
create mode 100644 src/fs/bootstrap_record.cc
create mode 100644 src/fs/cluster_allocator.cc
create mode 100644 src/fs/device_reader.cc
create mode 100644 src/fs/file.cc
create mode 100644 src/fs/metadata_log.cc
create mode 100644 src/fs/metadata_log_bootstrap.cc
create mode 100644 tests/unit/fs_block_device_test.cc
create mode 100644 tests/unit/fs_bootstrap_record_test.cc
create mode 100644 tests/unit/fs_cluster_allocator_test.cc
create mode 100644 tests/unit/fs_log_bootstrap_test.cc
create mode 100644 tests/unit/fs_metadata_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_mock_block_device.cc
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_path_test.cc
create mode 100644 tests/unit/fs_seastarfs_test.cc
create mode 100644 tests/unit/fs_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_truncate_test.cc
create mode 100644 tests/unit/fs_write_test.cc
create mode 100644 src/fs/README.md

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:19 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
block_device is an abstraction over an opened block device file or
opened ordinary file of fixed size. It offers:
- openning and closing file (block device)
- aligned reads and writes
- flushing

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/block_device.hh | 102 +++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 103 insertions(+)
create mode 100644 include/seastar/fs/block_device.hh

diff --git a/include/seastar/fs/block_device.hh b/include/seastar/fs/block_device.hh
new file mode 100644
index 00000000..31822617
--- /dev/null
+++ b/include/seastar/fs/block_device.hh
@@ -0,0 +1,102 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/file.hh"
+#include "seastar/core/reactor.hh"
+
+namespace seastar::fs {
+
+class block_device_impl {
+public:
+ virtual ~block_device_impl() = default;
+
+ virtual future<size_t> write(uint64_t aligned_pos, const void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) = 0;
+ virtual future<size_t> read(uint64_t aligned_pos, void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) = 0;
+ virtual future<> flush() = 0;
+ virtual future<> close() = 0;
+};
+
+class block_device {
+ shared_ptr<block_device_impl> _block_device_impl;
+public:
+ block_device(shared_ptr<block_device_impl> impl) noexcept : _block_device_impl(std::move(impl)) {}
+
+ block_device() = default;
+
+ block_device(const block_device&) = default;
+ block_device(block_device&&) noexcept = default;
+ block_device& operator=(const block_device&) noexcept = default;
+ block_device& operator=(block_device&&) noexcept = default;
+
+ explicit operator bool() const noexcept { return bool(_block_device_impl); }
+
+ template <typename CharType>
+ future<size_t> read(uint64_t aligned_offset, CharType* aligned_buffer, size_t aligned_len, const io_priority_class& pc = default_priority_class()) {
+ return _block_device_impl->read(aligned_offset, aligned_buffer, aligned_len, pc);
+ }
+
+ template <typename CharType>
+ future<size_t> write(uint64_t aligned_offset, const CharType* aligned_buffer, size_t aligned_len, const io_priority_class& pc = default_priority_class()) {
+ return _block_device_impl->write(aligned_offset, aligned_buffer, aligned_len, pc);
+ }
+
+ future<> flush() {
+ return _block_device_impl->flush();
+ }
+
+ future<> close() {
+ return _block_device_impl->close();
+ }
+};
+
+class file_block_device_impl : public block_device_impl {
+ file _file;
+public:
+ explicit file_block_device_impl(file f) : _file(std::move(f)) {}
+
+ ~file_block_device_impl() override = default;
+
+ future<size_t> write(uint64_t aligned_pos, const void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) override {
+ return _file.dma_write(aligned_pos, aligned_buffer, aligned_len, pc);
+ }
+
+ future<size_t> read(uint64_t aligned_pos, void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) override {
+ return _file.dma_read(aligned_pos, aligned_buffer, aligned_len, pc);
+ }
+
+ future<> flush() override {
+ return _file.flush();
+ }
+
+ future<> close() override {
+ return _file.close();
+ }
+};
+
+inline future<block_device> open_block_device(std::string name) {
+ return open_file_dma(std::move(name), open_flags::rw).then([](file f) {
+ return block_device(make_shared<file_block_device_impl>(std::move(f)));
+ });
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index be4f02c8..b50abf99 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -657,6 +657,7 @@ if (Seastar_EXPERIMENTAL_FS)
target_sources(seastar
PRIVATE
# SeastarFS source files
+ include/seastar/fs/block_device.hh
)
endif()

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:21 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
temporary_file is a handle to a temprorary file with a path.
It creates temporary file upon construction and removes it upon
destruction.
The main use case is testing the file system on a temporary file that
simulates a block device.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/temporary_file.hh | 54 ++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 55 insertions(+)
create mode 100644 include/seastar/fs/temporary_file.hh

diff --git a/include/seastar/fs/temporary_file.hh b/include/seastar/fs/temporary_file.hh
new file mode 100644
index 00000000..c00282d9
--- /dev/null
+++ b/include/seastar/fs/temporary_file.hh
@@ -0,0 +1,54 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/posix.hh"
+
+#include <string>
+
+namespace seastar::fs {
+
+class temporary_file {
+ std::string _path;
+
+public:
+ explicit temporary_file(std::string path) : _path(std::move(path) + ".XXXXXX") {
+ int fd = mkstemp(_path.data());
+ throw_system_error_on(fd == -1);
+ close(fd);
+ }
+
+ ~temporary_file() {
+ unlink(_path.data());
+ }
+
+ temporary_file(const temporary_file&) = delete;
+ temporary_file& operator=(const temporary_file&) = delete;
+ temporary_file(temporary_file&&) noexcept = delete;
+ temporary_file& operator=(temporary_file&&) noexcept = delete;
+
+ const std::string& path() const noexcept {
+ return _path;
+ }
+};
+
+} // namespace seastar::fs
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 0ba7ee35..39d11ad8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -659,6 +659,7 @@ if (Seastar_EXPERIMENTAL_FS)
# SeastarFS source files
include/seastar/fs/block_device.hh
include/seastar/fs/file.hh
+ include/seastar/fs/temporary_file.hh
src/fs/file.cc
)
endif()
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:21 AM4/20/20
to seastar-dev@googlegroups.com, Aleksander Sorokin, sarna@scylladb.com, quport@gmail.com, wmitros@protonmail.com
From: Aleksander Sorokin <ank...@gmail.com>

Currently the only implementation of Seastar's file abstraction is
`posix_file_impl`. This patch provides another implementation, which keeps
a reference to our file system's metadata in it and uses the `block_device`
handle underneath. Implemented `seastarfs_file_impl`, which derives from
`file_impl` and provides a stub interface. At the moment it’s extremely
oversimplified and just treat the whole block device as a one huge file.
Along with it, provided a free function for creating this handle.

Signed-off-by: Aleksander Sorokin <ank...@gmail.com>
---
include/seastar/fs/file.hh | 55 +++++++++++++++++++
src/fs/file.cc | 108 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 2 +
3 files changed, 165 insertions(+)
create mode 100644 include/seastar/fs/file.hh
create mode 100644 src/fs/file.cc

diff --git a/include/seastar/fs/file.hh b/include/seastar/fs/file.hh
new file mode 100644
index 00000000..ae38f3a4
--- /dev/null
+++ b/include/seastar/fs/file.hh
@@ -0,0 +1,55 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/file.hh"
+#include "seastar/core/future.hh"
+#include "seastar/fs/block_device.hh"
+
+namespace seastar::fs {
+
+class seastarfs_file_impl : public file_impl {
+ block_device _block_device;
+ open_flags _open_flags;
+public:
+ seastarfs_file_impl(block_device dev, open_flags flags);
+ ~seastarfs_file_impl() override = default;
+
+ future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override;
+ future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override;
+ future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override;
+ future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override;
+ future<> flush() override;
+ future<struct stat> stat() override;
+ future<> truncate(uint64_t length) override;
+ future<> discard(uint64_t offset, uint64_t length) override;
+ future<> allocate(uint64_t position, uint64_t length) override;
+ future<uint64_t> size() override;
+ future<> close() noexcept override;
+ std::unique_ptr<file_handle_impl> dup() override;
+ subscription<directory_entry> list_directory(std::function<future<> (directory_entry de)> next) override;
+ future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) override;
+};
+
+future<file> open_file_dma(std::string name, open_flags flags);
+
+}
diff --git a/src/fs/file.cc b/src/fs/file.cc
new file mode 100644
index 00000000..4f4e0ac6
--- /dev/null
+++ b/src/fs/file.cc
@@ -0,0 +1,108 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "seastar/core/future.hh"
+#include "seastar/fs/block_device.hh"
+#include "seastar/fs/file.hh"
+
+namespace seastar::fs {
+
+seastarfs_file_impl::seastarfs_file_impl(block_device dev, open_flags flags)
+ : _block_device(std::move(dev))
+ , _open_flags(flags) {}
+
+future<size_t>
+seastarfs_file_impl::write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) {
+ return _block_device.write(pos, buffer, len, pc);
+}
+
+future<size_t>
+seastarfs_file_impl::write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<size_t>
+seastarfs_file_impl::read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) {
+ return _block_device.read(pos, buffer, len, pc);
+}
+
+future<size_t>
+seastarfs_file_impl::read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::flush() {
+ return _block_device.flush();
+}
+
+future<struct stat>
+seastarfs_file_impl::stat() {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::truncate(uint64_t) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::discard(uint64_t offset, uint64_t length) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::allocate(uint64_t position, uint64_t length) {
+ throw std::bad_function_call();
+}
+
+future<uint64_t>
+seastarfs_file_impl::size() {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::close() noexcept {
+ return _block_device.close();
+}
+
+std::unique_ptr<file_handle_impl>
+seastarfs_file_impl::dup() {
+ throw std::bad_function_call();
+}
+
+subscription<directory_entry>
+seastarfs_file_impl::list_directory(std::function<future<> (directory_entry de)> next) {
+ throw std::bad_function_call();
+}
+
+future<temporary_buffer<uint8_t>>
+seastarfs_file_impl::dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<file> open_file_dma(std::string name, open_flags flags) {
+ return open_block_device(std::move(name)).then([flags] (block_device bd) {
+ return file(make_shared<seastarfs_file_impl>(std::move(bd), flags));
+ });
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index b50abf99..0ba7ee35 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -658,6 +658,8 @@ if (Seastar_EXPERIMENTAL_FS)
PRIVATE
# SeastarFS source files
include/seastar/fs/block_device.hh
+ include/seastar/fs/file.hh
+ src/fs/file.cc
)
endif()

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:22 AM4/20/20
to seastar-dev@googlegroups.com, Aleksander Sorokin, sarna@scylladb.com, quport@gmail.com, wmitros@protonmail.com
From: Aleksander Sorokin <ank...@gmail.com>

Added first crude unit test for seastarfs_file_impl:
paralleel writing with newly created handle and reading the same data back.

Signed-off-by: Aleksander Sorokin <ank...@gmail.com>
---
tests/unit/fs_seastarfs_test.cc | 62 +++++++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 5 +++
2 files changed, 67 insertions(+)
create mode 100644 tests/unit/fs_seastarfs_test.cc

diff --git a/tests/unit/fs_seastarfs_test.cc b/tests/unit/fs_seastarfs_test.cc
new file mode 100644
index 00000000..25c3e8d5
--- /dev/null
+++ b/tests/unit/fs_seastarfs_test.cc
@@ -0,0 +1,62 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "seastar/core/aligned_buffer.hh"
+#include "seastar/core/file-types.hh"
+#include "seastar/core/file.hh"
+#include "seastar/core/thread.hh"
+#include "seastar/core/units.hh"
+#include "seastar/fs/file.hh"
+#include "seastar/fs/temporary_file.hh"
+#include "seastar/testing/thread_test_case.hh"
+
+using namespace seastar;
+using namespace fs;
+
+constexpr auto device_path = "/tmp/seastarfs";
+constexpr auto device_size = 16 * MB;
+
+SEASTAR_THREAD_TEST_CASE(parallel_read_write_test) {
+ const auto tf = temporary_file(device_path);
+ auto f = fs::open_file_dma(tf.path(), open_flags::rw).get0();
+ static auto alignment = f.memory_dma_alignment();
+
+ parallel_for_each(boost::irange<off_t>(0, device_size / alignment), [&f](auto i) {
+ auto wbuf = allocate_aligned_buffer<unsigned char>(alignment, alignment);
+ std::fill(wbuf.get(), wbuf.get() + alignment, i);
+ auto wb = wbuf.get();
+
+ return f.dma_write(i * alignment, wb, alignment).then(
+ [&f, i, wbuf = std::move(wbuf)](auto ret) mutable {
+ BOOST_REQUIRE_EQUAL(ret, alignment);
+ auto rbuf = allocate_aligned_buffer<unsigned char>(alignment, alignment);
+ auto rb = rbuf.get();
+ return f.dma_read(i * alignment, rb, alignment).then(
+ [f, rbuf = std::move(rbuf), wbuf = std::move(wbuf)](auto ret) {
+ BOOST_REQUIRE_EQUAL(ret, alignment);
+ BOOST_REQUIRE(std::equal(rbuf.get(), rbuf.get() + alignment, wbuf.get()));
+ });
+ });
+ }).wait();
+
+ f.flush().wait();
+ f.close().wait();
+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index 8f203721..f2c5187f 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -361,6 +361,11 @@ seastar_add_test (rpc
loopback_socket.hh
rpc_test.cc)

+if (Seastar_EXPERIMENTAL_FS)
+ seastar_add_test (fs_seastarfs
+ SOURCES fs_seastarfs_test.cc)
+endif()
+
seastar_add_test (semaphore
SOURCES semaphore_test.cc)

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:23 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
What is tested:
- simple reads and writes
- parallel non-overlaping writes then parallel non-overlaping reads
- random and simultaneous reads and writes

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
tests/unit/fs_block_device_test.cc | 206 +++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 3 +
2 files changed, 209 insertions(+)
create mode 100644 tests/unit/fs_block_device_test.cc

diff --git a/tests/unit/fs_block_device_test.cc b/tests/unit/fs_block_device_test.cc
new file mode 100644
index 00000000..6887005c
--- /dev/null
+++ b/tests/unit/fs_block_device_test.cc
@@ -0,0 +1,206 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "seastar/core/do_with.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/temporary_buffer.hh"
+
+#include <boost/range/irange.hpp>
+#include <random>
+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <seastar/core/units.hh>
+#include <seastar/fs/block_device.hh>
+#include <seastar/fs/temporary_file.hh>
+#include <seastar/testing/test_runner.hh>
+#include <unistd.h>
+
+using namespace seastar;
+using namespace seastar::fs;
+
+constexpr off_t min_device_size = 16*MB;
+constexpr size_t alignment = 4*KB;
+
+static future<temporary_buffer<char>> allocate_random_aligned_buffer(size_t size) {
+ return do_with(temporary_buffer<char>::aligned(alignment, size),
+ std::default_random_engine(testing::local_random_engine()), [size](auto& buffer, auto& random_engine) {
+ return do_for_each(buffer.get_write(), buffer.get_write() + size, [&](char& c) {
+ std::uniform_int_distribution<> character(0, sizeof(char) * 8 - 1);
+ c = character(random_engine);
+ }).then([&buffer] {
+ return std::move(buffer);
+ });
+ });
+}
+
+static future<> test_basic_read_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*KB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+ auto check_buffer = allocate_random_aligned_buffer(buff_size).get0();
+
+ // Write and read
+ assert(dev.write(0, buffer.get(), buff_size).get0() == buff_size);
+ assert(dev.read(0, check_buffer.get_write(), buff_size).get0() == buff_size);
+ assert(std::memcmp(buffer.get(), check_buffer.get(), buff_size) == 0);
+
+ // Data have to remain after closing
+ dev.close().get0();
+ dev = open_block_device(device_path).get0();
+ check_buffer = allocate_random_aligned_buffer(buff_size).get0(); // Make sure the buffer is written
+ assert(dev.read(0, check_buffer.get_write(), buff_size).get0() == buff_size);
+ assert(std::memcmp(buffer.get(), check_buffer.get(), buff_size) == 0);
+
+ dev.close().get0();
+ });
+}
+
+static future<> test_parallel_read_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*MB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+
+ // Write
+ static_assert(buff_size % alignment == 0);
+ parallel_for_each(boost::irange<off_t>(0, buff_size / alignment), [&](off_t block_no) {
+ off_t offset = block_no * alignment;
+ return dev.write(offset, buffer.get() + offset, alignment).then([](size_t written) {
+ assert(written == alignment);
+ });
+ }).get0();
+
+ // Read
+ static_assert(buff_size % alignment == 0);
+ parallel_for_each(boost::irange<off_t>(0, buff_size / alignment), [&](off_t block_no) {
+ return async([&dev, &buffer, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ assert(std::memcmp(buffer.get() + offset, check_buffer.get(), alignment) == 0);
+ });
+ }).get0();
+
+ dev.close().get0();
+ });
+}
+
+static future<> test_simultaneous_parallel_read_and_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*MB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+ assert(dev.write(0, buffer.get(), buff_size).get0() == buff_size);
+
+ static_assert(buff_size % alignment == 0);
+ size_t blocks_num = buff_size / alignment;
+ enum Kind { WRITE, READ };
+ std::vector<Kind> block_kind(blocks_num);
+ std::default_random_engine random_engine(testing::local_random_engine());
+ std::uniform_int_distribution<> choose_write(0, 1);
+ for (Kind& kind : block_kind) {
+ kind = (choose_write(random_engine) ? WRITE : READ);
+ }
+
+ // Perform simultaneous reads and writes
+ auto new_buffer = allocate_random_aligned_buffer(buff_size).get0();
+ auto write_fut = parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ if (block_kind[block_no] != WRITE) {
+ return now();
+ }
+
+ off_t offset = block_no * alignment;
+ return dev.write(offset, new_buffer.get() + offset, alignment).then([](size_t written) {
+ assert(written == alignment);
+ });
+ });
+ auto read_fut = parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ if (block_kind[block_no] != READ) {
+ return now();
+ }
+
+ return async([&dev, &buffer, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ assert(std::memcmp(buffer.get() + offset, check_buffer.get(), alignment) == 0);
+ });
+ });
+
+ when_all_succeed(std::move(write_fut), std::move(read_fut)).get0();
+
+ // Check that writes were made in the correct places
+ parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ return async([&dev, &buffer, &new_buffer, &block_kind, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ auto& orig_buff = (block_kind[block_no] == WRITE ? new_buffer : buffer);
+ assert(std::memcmp(orig_buff.get() + offset, check_buffer.get(), alignment) == 0);
+ });
+ }).get0();
+
+ dev.close().get0();
+ });
+}
+
+static future<> prepare_file(const std::string& file_path) {
+ return async([&] {
+ // Create device file if it does exist
+ file dev = open_file_dma(file_path, open_flags::rw | open_flags::create).get0();
+
+ auto st = dev.stat().get0();
+ if (S_ISREG(st.st_mode) and st.st_size < min_device_size) {
+ dev.truncate(min_device_size).get0();
+ }
+
+ dev.close().get0();
+ });
+}
+
+int main(int argc, char** argv) {
+ app_template app;
+ app.add_options()
+ ("help", "produce this help message")
+ ("dev", boost::program_options::value<std::string>(),
+ "optional path to device file to test block_device on");
+ return app.run(argc, argv, [&app] {
+ return async([&] {
+ auto& args = app.configuration();
+ std::optional<temporary_file> tmp_device_file;
+ std::string device_path = [&]() -> std::string {
+ if (args.count("dev")) {
+ return args["dev"].as<std::string>();
+ }
+
+ tmp_device_file.emplace("/tmp/block_device_test_file");
+ return tmp_device_file->path();
+ }();
+
+ assert(not device_path.empty());
+ prepare_file(device_path).get0();
+ test_basic_read_write(device_path).get0();
+ test_parallel_read_write(device_path).get0();
+ test_simultaneous_parallel_read_and_write(device_path).get0();
+ });
+ });
+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index f2c5187f..21e564fb 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -362,6 +362,9 @@ seastar_add_test (rpc
rpc_test.cc)

if (Seastar_EXPERIMENTAL_FS)
+ seastar_add_app_test (fs_block_device
+ SOURCES fs_block_device_test.cc
+ LIBRARIES seastar_testing)
seastar_add_test (fs_seastarfs
SOURCES fs_seastarfs_test.cc)
endif()
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:24 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
- units.hh: basic units
- cluster.hh: cluster_id unit and operations on it (converting cluster
ids to offsets)
- inode.hh: inode unit and operations on it (extracting shard_no from
inode and allocating new inode)
- bitwise.hh: bitwise operations
- range.hh: range abstraction

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/bitwise.hh | 125 ++++++++++++++++++++++++++++++++++++++++++++++
src/fs/cluster.hh | 42 ++++++++++++++++
src/fs/inode.hh | 80 +++++++++++++++++++++++++++++
src/fs/range.hh | 61 ++++++++++++++++++++++
src/fs/units.hh | 40 +++++++++++++++
CMakeLists.txt | 5 ++
6 files changed, 353 insertions(+)
create mode 100644 src/fs/bitwise.hh
create mode 100644 src/fs/cluster.hh
create mode 100644 src/fs/inode.hh
create mode 100644 src/fs/range.hh
create mode 100644 src/fs/units.hh

diff --git a/src/fs/bitwise.hh b/src/fs/bitwise.hh
new file mode 100644
index 00000000..e53c1919
--- /dev/null
+++ b/src/fs/bitwise.hh
@@ -0,0 +1,125 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include <cassert>
+#include <type_traits>
+
+namespace seastar::fs {
+
+template<class T, std::enable_if_t<std::is_unsigned_v<T>, int> = 0>
+constexpr inline bool is_power_of_2(T x) noexcept {
+ return (x > 0 and (x & (x - 1)) == 0);
+}
+
+static_assert(not is_power_of_2(0u));
+static_assert(is_power_of_2(1u));
+static_assert(is_power_of_2(2u));
+static_assert(not is_power_of_2(3u));
+static_assert(is_power_of_2(4u));
+static_assert(not is_power_of_2(5u));
+static_assert(not is_power_of_2(6u));
+static_assert(not is_power_of_2(7u));
+static_assert(is_power_of_2(8u));
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T div_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a >> __builtin_ctzll(b)); // should be 2 CPU cycles after inlining on modern x86_64
+}
+
+static_assert(div_by_power_of_2(13u, 1u) == 13);
+static_assert(div_by_power_of_2(12u, 4u) == 3);
+static_assert(div_by_power_of_2(42u, 32u) == 1);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T mod_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a & (b - 1));
+}
+
+static_assert(mod_by_power_of_2(13u, 1u) == 0);
+static_assert(mod_by_power_of_2(42u, 32u) == 10);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T mul_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a << __builtin_ctzll(b)); // should be 2 CPU cycles after inlining on modern x86_64
+}
+
+static_assert(mul_by_power_of_2(3u, 1u) == 3);
+static_assert(mul_by_power_of_2(3u, 4u) == 12);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T round_down_to_multiple_of_power_of_2(T a, U b) noexcept {
+ return a - mod_by_power_of_2(a, b);
+}
+
+static_assert(round_down_to_multiple_of_power_of_2(0u, 1u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(1u, 1u) == 1);
+static_assert(round_down_to_multiple_of_power_of_2(19u, 1u) == 19);
+
+static_assert(round_down_to_multiple_of_power_of_2(0u, 2u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(1u, 2u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(2u, 2u) == 2);
+static_assert(round_down_to_multiple_of_power_of_2(3u, 2u) == 2);
+static_assert(round_down_to_multiple_of_power_of_2(4u, 2u) == 4);
+static_assert(round_down_to_multiple_of_power_of_2(5u, 2u) == 4);
+
+static_assert(round_down_to_multiple_of_power_of_2(31u, 16u) == 16);
+static_assert(round_down_to_multiple_of_power_of_2(32u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(33u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(37u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(39u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(45u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(47u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(48u, 16u) == 48);
+static_assert(round_down_to_multiple_of_power_of_2(49u, 16u) == 48);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T round_up_to_multiple_of_power_of_2(T a, U b) noexcept {
+ auto mod = mod_by_power_of_2(a, b);
+ return (mod == 0 ? a : a - mod + b);
+}
+
+static_assert(round_up_to_multiple_of_power_of_2(0u, 1u) == 0);
+static_assert(round_up_to_multiple_of_power_of_2(1u, 1u) == 1);
+static_assert(round_up_to_multiple_of_power_of_2(19u, 1u) == 19);
+
+static_assert(round_up_to_multiple_of_power_of_2(0u, 2u) == 0);
+static_assert(round_up_to_multiple_of_power_of_2(1u, 2u) == 2);
+static_assert(round_up_to_multiple_of_power_of_2(2u, 2u) == 2);
+static_assert(round_up_to_multiple_of_power_of_2(3u, 2u) == 4);
+static_assert(round_up_to_multiple_of_power_of_2(4u, 2u) == 4);
+static_assert(round_up_to_multiple_of_power_of_2(5u, 2u) == 6);
+
+static_assert(round_up_to_multiple_of_power_of_2(31u, 16u) == 32);
+static_assert(round_up_to_multiple_of_power_of_2(32u, 16u) == 32);
+static_assert(round_up_to_multiple_of_power_of_2(33u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(37u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(39u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(45u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(47u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(48u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(49u, 16u) == 64);
+
+} // namespace seastar::fs
diff --git a/src/fs/cluster.hh b/src/fs/cluster.hh
new file mode 100644
index 00000000..a35ce323
--- /dev/null
+++ b/src/fs/cluster.hh
@@ -0,0 +1,42 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+
+namespace seastar::fs {
+
+using cluster_id_t = uint64_t;
+using cluster_range = range<cluster_id_t>;
+
+inline cluster_id_t offset_to_cluster_id(disk_offset_t offset, unit_size_t cluster_size) noexcept {
+ assert(is_power_of_2(cluster_size));
+ return div_by_power_of_2(offset, cluster_size);
+}
+
+inline disk_offset_t cluster_id_to_offset(cluster_id_t cluster_id, unit_size_t cluster_size) noexcept {
+ assert(is_power_of_2(cluster_size));
+ return mul_by_power_of_2(cluster_id, cluster_size);
+}
+
+} // namespace seastar::fs
diff --git a/src/fs/inode.hh b/src/fs/inode.hh
new file mode 100644
index 00000000..aabc8d00
--- /dev/null
+++ b/src/fs/inode.hh
@@ -0,0 +1,80 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+
+#include <cstdint>
+#include <optional>
+
+namespace seastar::fs {
+
+// Last log2(fs_shards_pool_size bits) of the inode number contain the id of shard that owns the inode
+using inode_t = uint64_t;
+
+// Obtains shard id of the shard owning @p inode.
+//@p fs_shards_pool_size is the number of file system shards rounded up to a power of 2
+inline fs_shard_id_t inode_to_shard_no(inode_t inode, fs_shard_id_t fs_shards_pool_size) noexcept {
+ assert(is_power_of_2(fs_shards_pool_size));
+ return mod_by_power_of_2(inode, fs_shards_pool_size);
+}
+
+// Returns inode belonging to the shard owning @p shard_previous_inode that is next after @p shard_previous_inode
+// (i.e. the lowest inode greater than @p shard_previous_inode belonging to the same shard)
+//@p fs_shards_pool_size is the number of file system shards rounded up to a power of 2
+inline inode_t shard_next_inode(inode_t shard_previous_inode, fs_shard_id_t fs_shards_pool_size) noexcept {
+ return shard_previous_inode + fs_shards_pool_size;
+}
+
+// Returns first inode (lowest by value) belonging to the shard @p fs_shard_id
+inline inode_t shard_first_inode(fs_shard_id_t fs_shard_id) noexcept {
+ return fs_shard_id;
+}
+
+class shard_inode_allocator {
+ fs_shard_id_t _fs_shards_pool_size;
+ fs_shard_id_t _fs_shard_id;
+ std::optional<inode_t> _latest_allocated_inode;
+
+public:
+ shard_inode_allocator(fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id, std::optional<inode_t> latest_allocated_inode = std::nullopt)
+ : _fs_shards_pool_size(fs_shards_pool_size)
+ , _fs_shard_id(fs_shard_id)
+ , _latest_allocated_inode(latest_allocated_inode) {}
+
+ inode_t alloc() noexcept {
+ if (not _latest_allocated_inode) {
+ _latest_allocated_inode = shard_first_inode(_fs_shard_id);
+ return *_latest_allocated_inode;
+ }
+
+ _latest_allocated_inode = shard_next_inode(*_latest_allocated_inode, _fs_shards_pool_size);
+ return *_latest_allocated_inode;
+ }
+
+ void reset(std::optional<inode_t> latest_allocated_inode = std::nullopt) noexcept {
+ _latest_allocated_inode = latest_allocated_inode;
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/range.hh b/src/fs/range.hh
new file mode 100644
index 00000000..ef0c6756
--- /dev/null
+++ b/src/fs/range.hh
@@ -0,0 +1,61 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include <algorithm>
+
+namespace seastar::fs {
+
+template <class T>
+struct range {
+ T beg;
+ T end; // exclusive
+
+ constexpr bool is_empty() const noexcept { return beg >= end; }
+
+ constexpr T size() const noexcept { return end - beg; }
+};
+
+template <class T>
+range(T beg, T end) -> range<T>;
+
+template <class T>
+inline bool operator==(range<T> a, range<T> b) noexcept {
+ return (a.beg == b.beg and a.end == b.end);
+}
+
+template <class T>
+inline bool operator!=(range<T> a, range<T> b) noexcept {
+ return not (a == b);
+}
+
+template <class T>
+inline range<T> intersection(range<T> a, range<T> b) noexcept {
+ return {std::max(a.beg, b.beg), std::min(a.end, b.end)};
+}
+
+template <class T>
+inline bool are_intersecting(range<T> a, range<T> b) noexcept {
+ return (std::max(a.beg, b.beg) < std::min(a.end, b.end));
+}
+
+} // namespace seastar::fs
diff --git a/src/fs/units.hh b/src/fs/units.hh
new file mode 100644
index 00000000..1fc6754b
--- /dev/null
+++ b/src/fs/units.hh
@@ -0,0 +1,40 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/range.hh"
+
+#include <cstdint>
+
+namespace seastar::fs {
+
+using unit_size_t = uint32_t;
+
+using disk_offset_t = uint64_t;
+using disk_range = range<disk_offset_t>;
+
+using file_offset_t = uint64_t;
+using file_range = range<file_offset_t>;
+
+using fs_shard_id_t = uint32_t;
+
+} // namespace seastar::fs
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 39d11ad8..8ad08c7a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -660,7 +660,12 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/block_device.hh
include/seastar/fs/file.hh
include/seastar/fs/temporary_file.hh
+ src/fs/bitwise.hh
+ src/fs/cluster.hh
src/fs/file.cc
+ src/fs/inode.hh
+ src/fs/range.hh
+ src/fs/units.hh
)
endif()

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:25 AM4/20/20
to seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com
From: Wojciech Mitros <wmi...@protonmail.com>

Disk space is divided into segments of set size, called clusters. Each shard of
the filesystem will be assigned a set of clusters. Cluster allocator is the tool
that enables allocating and freeing clusters from that set.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---
src/fs/cluster_allocator.hh | 50 ++++++++++++++++++++++++++++++++++
src/fs/cluster_allocator.cc | 54 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 2 ++
3 files changed, 106 insertions(+)
create mode 100644 src/fs/cluster_allocator.hh
create mode 100644 src/fs/cluster_allocator.cc

diff --git a/src/fs/cluster_allocator.hh b/src/fs/cluster_allocator.hh
new file mode 100644
index 00000000..ef4f30b9
--- /dev/null
+++ b/src/fs/cluster_allocator.hh
@@ -0,0 +1,50 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/cluster.hh"
+
+#include <deque>
+#include <optional>
+#include <unordered_set>
+
+namespace seastar {
+
+namespace fs {
+
+class cluster_allocator {
+ std::unordered_set<cluster_id_t> _allocated_clusters;
+ std::deque<cluster_id_t> _free_clusters;
+
+public:
+ explicit cluster_allocator(std::unordered_set<cluster_id_t> allocated_clusters, std::deque<cluster_id_t> free_clusters);
+
+ // Tries to allocate a cluster
+ std::optional<cluster_id_t> alloc();
+
+ // @p cluster_id has to be allocated using alloc()
+ void free(cluster_id_t cluster_id);
+};
+
+}
+
+}
diff --git a/src/fs/cluster_allocator.cc b/src/fs/cluster_allocator.cc
new file mode 100644
index 00000000..c436c7ba
--- /dev/null
+++ b/src/fs/cluster_allocator.cc
@@ -0,0 +1,54 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "fs/cluster.hh"
+#include "fs/cluster_allocator.hh"
+
+#include <cassert>
+#include <optional>
+
+namespace seastar {
+
+namespace fs {
+
+cluster_allocator::cluster_allocator(std::unordered_set<cluster_id_t> allocated_clusters, std::deque<cluster_id_t> free_clusters)
+ : _allocated_clusters(std::move(allocated_clusters)), _free_clusters(std::move(free_clusters)) {}
+
+std::optional<cluster_id_t> cluster_allocator::alloc() {
+ if (_free_clusters.empty()) {
+ return std::nullopt;
+ }
+
+ cluster_id_t cluster_id = _free_clusters.front();
+ _free_clusters.pop_front();
+ _allocated_clusters.insert(cluster_id);
+ return cluster_id;
+}
+
+void cluster_allocator::free(cluster_id_t cluster_id) {
+ assert(_allocated_clusters.count(cluster_id) == 1);
+ _free_clusters.emplace_back(cluster_id);
+ _allocated_clusters.erase(cluster_id);
+}
+
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8ad08c7a..891201a3 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -662,6 +662,8 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
src/fs/cluster.hh
+ src/fs/cluster_allocator.cc
+ src/fs/cluster_allocator.hh
src/fs/file.cc
src/fs/inode.hh
src/fs/range.hh
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:26 AM4/20/20
to seastar-dev@googlegroups.com, Wojciech Mitros, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com
From: Wojciech Mitros <wmi...@protonmail.com>

Added tests checking whether the cluster allocator works correctly in ordinary
and corner (e.g. trying to alloc with no free clusters) cases.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---
tests/unit/fs_cluster_allocator_test.cc | 115 ++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 3 +
2 files changed, 118 insertions(+)
create mode 100644 tests/unit/fs_cluster_allocator_test.cc

diff --git a/tests/unit/fs_cluster_allocator_test.cc b/tests/unit/fs_cluster_allocator_test.cc
new file mode 100644
index 00000000..3650254e
--- /dev/null
+++ b/tests/unit/fs_cluster_allocator_test.cc
@@ -0,0 +1,115 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#define BOOST_TEST_MODULE fs
+
+#include "fs/cluster_allocator.hh"
+
+#include <boost/test/included/unit_test.hpp>
+#include <deque>
+#include <seastar/core/units.hh>
+#include <unordered_set>
+
+using namespace seastar;
+
+BOOST_AUTO_TEST_CASE(test_cluster_0) {
+ fs::cluster_allocator ca({}, {0});
+ BOOST_REQUIRE_EQUAL(ca.alloc().value(), 0);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ ca.free(0);
+ BOOST_REQUIRE_EQUAL(ca.alloc().value(), 0);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+}
+
+BOOST_AUTO_TEST_CASE(test_empty) {
+ fs::cluster_allocator empty_ca{{}, {}};
+ BOOST_REQUIRE(empty_ca.alloc() == std::nullopt);
+}
+
+BOOST_AUTO_TEST_CASE(test_small) {
+ std::deque<fs::cluster_id_t> deq{1, 5, 3, 4, 2};
+ fs::cluster_allocator small_ca({}, deq);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[0]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[1]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+
+ small_ca.free(deq[2]);
+ small_ca.free(deq[1]);
+ small_ca.free(deq[3]);
+ small_ca.free(deq[0]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[4]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[1]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[0]);
+ BOOST_REQUIRE(small_ca.alloc() == std::nullopt);
+
+ small_ca.free(deq[2]);
+ small_ca.free(deq[4]);
+ small_ca.free(deq[3]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[4]);
+ small_ca.free(deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+ small_ca.free(deq[4]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+}
+
+BOOST_AUTO_TEST_CASE(test_max) {
+ constexpr fs::cluster_id_t clusters_per_shard = 1024;
+ std::deque<fs::cluster_id_t> deq;
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ deq.emplace_back(i);
+ }
+ fs::cluster_allocator ordinary_ca({}, deq);
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ BOOST_REQUIRE_EQUAL(ordinary_ca.alloc().value(), i);
+ }
+ BOOST_REQUIRE(ordinary_ca.alloc() == std::nullopt);
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ ordinary_ca.free(i);
+ }
+}
+
+BOOST_AUTO_TEST_CASE(test_pseudo_rand) {
+ std::unordered_set<fs::cluster_id_t> uset;
+ std::deque<fs::cluster_id_t> deq;
+ fs::cluster_id_t elem = 215;
+ while (elem != 806) {
+ deq.emplace_back(elem);
+ elem = (elem * 215) % 1021;
+ }
+ elem = 1;
+ while (elem != 1020) {
+ uset.insert(elem);
+ elem = (elem * 19) % 1021;
+ }
+ fs::cluster_allocator random_ca(uset, deq);
+ elem = 215;
+ while (elem != 1) {
+ BOOST_REQUIRE_EQUAL(random_ca.alloc().value(), elem);
+ random_ca.free(1021-elem);
+ elem = (elem * 215) % 1021;
+ }
+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index 21e564fb..b2669e0a 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -365,6 +365,9 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_app_test (fs_block_device
SOURCES fs_block_device_test.cc
LIBRARIES seastar_testing)
+ seastar_add_test (fs_cluster_allocator
+ KIND BOOST
+ SOURCES fs_cluster_allocator_test.cc)

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:27 AM4/20/20
to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com
From: Michał Niciejewski <qup...@gmail.com>

Corner case tests:
- simple valid tests for reading and writing bootstrap record
- valid and invalid number of shards (range
[1, bootstrap_record::max_shards_nb] is valid)
- invalid crc in read record
- invalid magic number in read record
- invalid information about filesystem shards:
* id of the first metadata log cluster not in available cluster range
* invalid cluster range
* overlapping available cluster ranges for two different shards
* invalid alignment
* invalid cluster size

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---
tests/unit/fs_mock_block_device.hh | 55 ++++
tests/unit/fs_bootstrap_record_test.cc | 414 +++++++++++++++++++++++++
tests/unit/fs_mock_block_device.cc | 50 +++
tests/unit/CMakeLists.txt | 4 +
4 files changed, 523 insertions(+)
create mode 100644 tests/unit/fs_mock_block_device.hh
create mode 100644 tests/unit/fs_bootstrap_record_test.cc
create mode 100644 tests/unit/fs_mock_block_device.cc

diff --git a/tests/unit/fs_mock_block_device.hh b/tests/unit/fs_mock_block_device.hh
new file mode 100644
index 00000000..08da1491
--- /dev/null
+++ b/tests/unit/fs_mock_block_device.hh
@@ -0,0 +1,55 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB Ltd.
+ */
+
+#pragma once
+
+#include <cstring>
+#include <seastar/fs/block_device.hh>
+
+namespace seastar::fs {
+
+class mock_block_device_impl : public block_device_impl {
+public:
+ using buf_type = basic_sstring<uint8_t, size_t, 32, false>;
+ buf_type buf;
+ ~mock_block_device_impl() override = default;
+
+ struct write_operation {
+ uint64_t disk_offset;
+ temporary_buffer<uint8_t> data;
+ };
+
+ std::vector<write_operation> writes;
+
+ future<size_t> write(uint64_t pos, const void* buffer, size_t len, const io_priority_class&) override;
+
+ future<size_t> read(uint64_t pos, void* buffer, size_t len, const io_priority_class&) noexcept override;
+
+ future<> flush() noexcept override {
+ return make_ready_future<>();
+ }
+
+ future<> close() noexcept override {
+ return make_ready_future<>();
+ }
+};
+
+} // seastar::fs
diff --git a/tests/unit/fs_bootstrap_record_test.cc b/tests/unit/fs_bootstrap_record_test.cc
new file mode 100644
index 00000000..9994f5ee
--- /dev/null
+++ b/tests/unit/fs_bootstrap_record_test.cc
@@ -0,0 +1,414 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "fs/bootstrap_record.hh"
+#include "fs/cluster.hh"
+#include "fs/crc.hh"
+#include "fs_mock_block_device.hh"
+
+#include <boost/crc.hpp>
+#include <cstring>
+#include <seastar/core/print.hh>
+#include <seastar/fs/block_device.hh>
+#include <seastar/testing/test_case.hh>
+#include <seastar/testing/test_runner.hh>
+#include <seastar/testing/thread_test_case.hh>
+
+using namespace seastar;
+using namespace seastar::fs;
+
+namespace {
+
+inline std::vector<bootstrap_record::shard_info> prepare_valid_shards_info(uint32_t size) {
+ std::vector<bootstrap_record::shard_info> ret(size);
+ cluster_id_t curr = 1;
+ for (bootstrap_record::shard_info& info : ret) {
+ info.available_clusters = {curr, curr + 1};
+ info.metadata_cluster = curr;
+ curr++;
+ }
+ return ret;
+};
+
+inline void repair_crc32(shared_ptr<mock_block_device_impl> dev_impl) noexcept {
+ mock_block_device_impl::buf_type& buff = dev_impl.get()->buf;
+ constexpr size_t crc_pos = offsetof(bootstrap_record_disk, crc);
+ const uint32_t crc_new = crc32(buff.data(), crc_pos);
+ std::memcpy(buff.data() + crc_pos, &crc_new, sizeof(crc_new));
+}
+
+inline void change_byte_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset) noexcept {
+ dev_impl.get()->buf[offset] ^= 1;
+}
+
+template<typename T>
+inline void place_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset, T value) noexcept {
+ std::memcpy(dev_impl.get()->buf.data() + offset, &value, sizeof(value));
+}
+
+template<>
+inline void place_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset,
+ std::vector<bootstrap_record::shard_info> shards_info) noexcept {
+ bootstrap_record::shard_info shards_info_disk[bootstrap_record::max_shards_nb];
+ std::memset(shards_info_disk, 0, sizeof(shards_info_disk));
+ std::copy(shards_info.begin(), shards_info.end(), shards_info_disk);
+
+ std::memcpy(dev_impl.get()->buf.data() + offset, shards_info_disk, sizeof(shards_info_disk));
+}
+
+inline bool check_exception_message(const invalid_bootstrap_record& ex, const sstring& message) {
+ return sstring(ex.what()).find(message) != sstring::npos;
+}
+
+const bootstrap_record default_write_record(1, bootstrap_record::min_alignment * 4,
+ bootstrap_record::min_alignment * 8, 1, {{6, {6, 9}}, {9, {9, 12}}, {12, {12, 15}}});
+
+}
+
+
+
+BOOST_TEST_DONT_PRINT_LOG_VALUE(bootstrap_record)
+
+SEASTAR_THREAD_TEST_CASE(valid_basic_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+SEASTAR_THREAD_TEST_CASE(valid_max_shards_nb_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(bootstrap_record::max_shards_nb);
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+SEASTAR_THREAD_TEST_CASE(valid_one_shard_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(1);
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+
+
+SEASTAR_THREAD_TEST_CASE(invalid_crc_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t crc_offset = offsetof(bootstrap_record_disk, crc);
+
+ write_record.write_to_disk(dev).get();
+ change_byte_at_offset(dev_impl, crc_offset);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid CRC");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_magic_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t magic_offset = offsetof(bootstrap_record_disk, magic);
+
+ write_record.write_to_disk(dev).get();
+ change_byte_at_offset(dev_impl, magic_offset);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid magic number");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_shards_info_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t shards_nb_offset = offsetof(bootstrap_record_disk, shards_nb);
+ constexpr size_t shards_info_offset = offsetof(bootstrap_record_disk, shards_info);
+
+ // shards_nb > max_shards_nb
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, shards_nb_offset, bootstrap_record::max_shards_nb + 1);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Shards number should be smaller or equal to {}", bootstrap_record::max_shards_nb));
+ });
+
+ // shards_nb == 0
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, shards_nb_offset, 0);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Shards number should be greater than 0");
+ });
+
+ std::vector<bootstrap_record::shard_info> shards_info;
+
+ // metadata_cluster not in available_clusters range
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {2, 3}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ write_record.write_to_disk(dev).get();
+ shards_info = {{3, {2, 3}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ // available_clusters.beg > available_clusters.end
+ write_record.write_to_disk(dev).get();
+ shards_info = {{3, {4, 2}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters.beg == available_clusters.end
+ write_record.write_to_disk(dev).get();
+ shards_info = {{2, {2, 2}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters contains cluster 0
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {0, 5}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Range of available clusters should not contain cluster 0");
+ });
+
+ // available_clusters overlap
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {1, 3}}, {2, {2, 4}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster ranges should not overlap");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_alignment_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t alignment_offset = offsetof(bootstrap_record_disk, alignment);
+
+ // alignment not power of 2
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, alignment_offset, bootstrap_record::min_alignment + 1);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Alignment should be a power of 2");
+ });
+
+ // alignment smaller than 512
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, alignment_offset, bootstrap_record::min_alignment / 2);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Alignment should be greater or equal to {}", bootstrap_record::min_alignment));
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_cluster_size_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t cluster_size_offset = offsetof(bootstrap_record_disk, cluster_size);
+
+ // cluster_size not divisible by alignment
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, cluster_size_offset, write_record.alignment / 2);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be divisible by alignment");
+ });
+
+ // cluster_size not power of 2
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, cluster_size_offset, write_record.alignment * 3);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be a power of 2");
+ });
+}
+
+
+
+SEASTAR_THREAD_TEST_CASE(invalid_shards_info_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // shards_nb > max_shards_nb
+ write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(bootstrap_record::max_shards_nb + 1);
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Shards number should be smaller or equal to {}", bootstrap_record::max_shards_nb));
+ });
+
+ // shards_nb == 0
+ write_record = default_write_record;
+ write_record.shards_info.clear();
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Shards number should be greater than 0");
+ });
+
+ // metadata_cluster not in available_clusters range
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {2, 3}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ write_record = default_write_record;
+ write_record.shards_info = {{3, {2, 3}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ // available_clusters.beg > available_clusters.end
+ write_record = default_write_record;
+ write_record.shards_info = {{3, {4, 2}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters.beg == available_clusters.end
+ write_record = default_write_record;
+ write_record.shards_info = {{2, {2, 2}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters contains cluster 0
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {0, 5}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Range of available clusters should not contain cluster 0");
+ });
+
+ // available_clusters overlap
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {1, 3}}, {2, {2, 4}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster ranges should not overlap");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_alignment_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // alignment not power of 2
+ write_record = default_write_record;
+ write_record.alignment = bootstrap_record::min_alignment + 1;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Alignment should be a power of 2");
+ });
+
+ // alignment smaller than bootstrap_record::min_alignment
+ write_record = default_write_record;
+ write_record.alignment = bootstrap_record::min_alignment / 2;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Alignment should be greater or equal to {}", bootstrap_record::min_alignment));
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_cluster_size_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // cluster_size not divisible by alignment
+ write_record = default_write_record;
+ write_record.cluster_size = write_record.alignment / 2;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be divisible by alignment");
+ });
+
+ // cluster_size not power of 2
+ write_record = default_write_record;
+ write_record.cluster_size = write_record.alignment * 3;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be a power of 2");
+ });
+}
diff --git a/tests/unit/fs_mock_block_device.cc b/tests/unit/fs_mock_block_device.cc
new file mode 100644
index 00000000..6f83587e
--- /dev/null
+++ b/tests/unit/fs_mock_block_device.cc
@@ -0,0 +1,50 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB Ltd.
+ */
+
+#include "fs_mock_block_device.hh"
+
+namespace seastar::fs {
+
+namespace {
+logger mlogger("fs_mock_block_device");
+} // namespace
+
+future<size_t> mock_block_device_impl::write(uint64_t pos, const void* buffer, size_t len, const io_priority_class&) {
+ mlogger.debug("write({}, ..., {})", pos, len);
+ writes.emplace_back(write_operation {
+ pos,
+ temporary_buffer<uint8_t>(static_cast<const uint8_t*>(buffer), len)
+ });
+ if (buf.size() < pos + len)
+ buf.resize(pos + len);
+ std::memcpy(buf.data() + pos, buffer, len);
+ return make_ready_future<size_t>(len);
+}
+
+future<size_t> mock_block_device_impl::read(uint64_t pos, void* buffer, size_t len, const io_priority_class&) noexcept {
+ mlogger.debug("read({}, ..., {})", pos, len);
+ if (buf.size() < pos + len)
+ buf.resize(pos + len);
+ std::memcpy(buffer, buf.c_str() + pos, len);
+ return make_ready_future<size_t>(len);
+}
+
+} // seastar::fs
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index b2669e0a..f9591046 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -365,6 +365,10 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_app_test (fs_block_device
SOURCES fs_block_device_test.cc
LIBRARIES seastar_testing)
+ seastar_add_test (fs_bootstrap_record
+ SOURCES
+ fs_bootstrap_record_test.cc
+ fs_mock_block_device.cc)
seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:27 AM4/20/20
to seastar-dev@googlegroups.com, Michał Niciejewski, sarna@scylladb.com, ankezy@gmail.com, wmitros@protonmail.com
From: Michał Niciejewski <qup...@gmail.com>

Bootstrap record serves the same role as the superblock in other
filesystems.
It contains basic information essential to properly bootstrap the
filesystem:
- filesystem version
- alignment used for data writes
- cluster size
- inode number of the root directory
- information needed to bootstrap every shard of the filesystem:
* id of the first metadata log cluster
* range of available clusters for data and metadata

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---
src/fs/bootstrap_record.hh | 98 ++++++++++++++++++
src/fs/crc.hh | 34 ++++++
src/fs/bootstrap_record.cc | 206 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 3 +
4 files changed, 341 insertions(+)
create mode 100644 src/fs/bootstrap_record.hh
create mode 100644 src/fs/crc.hh
create mode 100644 src/fs/bootstrap_record.cc

diff --git a/src/fs/bootstrap_record.hh b/src/fs/bootstrap_record.hh
new file mode 100644
index 00000000..ee15295a
--- /dev/null
+++ b/src/fs/bootstrap_record.hh
@@ -0,0 +1,98 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/cluster.hh"
+#include "fs/inode.hh"
+#include "seastar/fs/block_device.hh"
+
+#include <exception>
+
+namespace seastar::fs {
+
+class invalid_bootstrap_record : public std::runtime_error {
+public:
+ explicit invalid_bootstrap_record(const std::string& msg) : std::runtime_error(msg) {}
+ explicit invalid_bootstrap_record(const char* msg) : std::runtime_error(msg) {}
+};
+
+/// In-memory version of the record describing characteristics of the file system (~superblock).
+class bootstrap_record {
+public:
+ static constexpr uint64_t magic_number = 0x5343594c4c414653; // SCYLLAFS
+ static constexpr uint32_t max_shards_nb = 500;
+ static constexpr unit_size_t min_alignment = 4096;
+
+ struct shard_info {
+ cluster_id_t metadata_cluster; /// cluster id of the first metadata log cluster
+ cluster_range available_clusters; /// range of clusters for data for this shard
+ };
+
+ uint64_t version; /// file system version
+ unit_size_t alignment; /// write alignment in bytes
+ unit_size_t cluster_size; /// cluster size in bytes
+ inode_t root_directory; /// root dir inode number
+ std::vector<shard_info> shards_info; /// basic informations about each file system shard
+
+ bootstrap_record() = default;
+ bootstrap_record(uint64_t version, unit_size_t alignment, unit_size_t cluster_size, inode_t root_directory,
+ std::vector<shard_info> shards_info)
+ : version(version), alignment(alignment), cluster_size(cluster_size) , root_directory(root_directory)
+ , shards_info(std::move(shards_info)) {}
+
+ /// number of file system shards
+ uint32_t shards_nb() const noexcept {
+ return shards_info.size();
+ }
+
+ static future<bootstrap_record> read_from_disk(block_device& device);
+ future<> write_to_disk(block_device& device) const;
+
+ friend bool operator==(const bootstrap_record&, const bootstrap_record&) noexcept;
+ friend bool operator!=(const bootstrap_record&, const bootstrap_record&) noexcept;
+};
+
+inline bool operator==(const bootstrap_record::shard_info& lhs, const bootstrap_record::shard_info& rhs) noexcept {
+ return lhs.metadata_cluster == rhs.metadata_cluster and lhs.available_clusters == rhs.available_clusters;
+}
+
+inline bool operator!=(const bootstrap_record::shard_info& lhs, const bootstrap_record::shard_info& rhs) noexcept {
+ return !(lhs == rhs);
+}
+
+inline bool operator!=(const bootstrap_record& lhs, const bootstrap_record& rhs) noexcept {
+ return !(lhs == rhs);
+}
+
+/// On-disk version of the record describing characteristics of the file system (~superblock).
+struct bootstrap_record_disk {
+ uint64_t magic;
+ uint64_t version;
+ unit_size_t alignment;
+ unit_size_t cluster_size;
+ inode_t root_directory;
+ uint32_t shards_nb;
+ bootstrap_record::shard_info shards_info[bootstrap_record::max_shards_nb];
+ uint32_t crc;
+};
+
+}
diff --git a/src/fs/crc.hh b/src/fs/crc.hh
new file mode 100644
index 00000000..da557323
--- /dev/null
+++ b/src/fs/crc.hh
@@ -0,0 +1,34 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include <boost/crc.hpp>
+
+namespace seastar::fs {
+
+inline uint32_t crc32(const void* buff, size_t len) noexcept {
+ boost::crc_32_type result;
+ result.process_bytes(buff, len);
+ return result.checksum();
+}
+
+}
diff --git a/src/fs/bootstrap_record.cc b/src/fs/bootstrap_record.cc
new file mode 100644
index 00000000..a342efb6
--- /dev/null
+++ b/src/fs/bootstrap_record.cc
@@ -0,0 +1,206 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "fs/bootstrap_record.hh"
+#include "fs/crc.hh"
+#include "seastar/core/print.hh"
+#include "seastar/core/units.hh"
+
+namespace seastar::fs {
+
+namespace {
+
+constexpr unit_size_t write_alignment = 4 * KB;
+constexpr disk_offset_t bootstrap_record_offset = 0;
+
+constexpr size_t aligned_bootstrap_record_size =
+ (1 + (sizeof(bootstrap_record_disk) - 1) / write_alignment) * write_alignment;
+constexpr size_t crc_offset = offsetof(bootstrap_record_disk, crc);
+
+inline std::optional<invalid_bootstrap_record> check_alignment(unit_size_t alignment) {
+ if (!is_power_of_2(alignment)) {
+ return invalid_bootstrap_record(fmt::format("Alignment should be a power of 2, read alignment '{}'",
+ alignment));
+ }
+ if (alignment < bootstrap_record::min_alignment) {
+ return invalid_bootstrap_record(fmt::format("Alignment should be greater or equal to {}, read alignment '{}'",
+ bootstrap_record::min_alignment, alignment));
+ }
+ return std::nullopt;
+}
+
+inline std::optional<invalid_bootstrap_record> check_cluster_size(unit_size_t cluster_size, unit_size_t alignment) {
+ if (!is_power_of_2(cluster_size)) {
+ return invalid_bootstrap_record(fmt::format("Cluster size should be a power of 2, read cluster size '{}'", cluster_size));
+ }
+ if (cluster_size % alignment != 0) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster size should be divisible by alignment, read alignment '{}', read cluster size '{}'",
+ alignment, cluster_size));
+ }
+ return std::nullopt;
+}
+
+inline std::optional<invalid_bootstrap_record> check_shards_number(uint32_t shards_nb) {
+ if (shards_nb == 0) {
+ return invalid_bootstrap_record(fmt::format("Shards number should be greater than 0, read shards number '{}'",
+ shards_nb));
+ }
+ if (shards_nb > bootstrap_record::max_shards_nb) {
+ return invalid_bootstrap_record(fmt::format(
+ "Shards number should be smaller or equal to {}, read shards number '{}'",
+ bootstrap_record::max_shards_nb, shards_nb));
+ }
+ return std::nullopt;
+}
+
+std::optional<invalid_bootstrap_record> check_shards_info(std::vector<bootstrap_record::shard_info> shards_info) {
+ // check 1 <= beg <= metadata_cluster < end
+ for (const bootstrap_record::shard_info& info : shards_info) {
+ if (info.available_clusters.beg >= info.available_clusters.end) {
+ return invalid_bootstrap_record(fmt::format("Invalid cluster range, read cluster range [{}, {})",
+ info.available_clusters.beg, info.available_clusters.end));
+ }
+ if (info.available_clusters.beg == 0) {
+ return invalid_bootstrap_record(fmt::format(
+ "Range of available clusters should not contain cluster 0, read cluster range [{}, {})",
+ info.available_clusters.beg, info.available_clusters.end));
+ }
+ if (info.available_clusters.beg > info.metadata_cluster ||
+ info.available_clusters.end <= info.metadata_cluster) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster with metadata should be inside available cluster range, read cluster range [{}, {}), read metadata cluster '{}'",
+ info.available_clusters.beg, info.available_clusters.end, info.metadata_cluster));
+ }
+ }
+
+ // check that ranges don't overlap
+ sort(shards_info.begin(), shards_info.end(),
+ [] (const bootstrap_record::shard_info& left,
+ const bootstrap_record::shard_info& right) {
+ return left.available_clusters.beg < right.available_clusters.beg;
+ });
+ for (size_t i = 1; i < shards_info.size(); i++) {
+ if (shards_info[i - 1].available_clusters.end > shards_info[i].available_clusters.beg) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster ranges should not overlap, overlaping ranges [{}, {}), [{}, {})",
+ shards_info[i - 1].available_clusters.beg, shards_info[i - 1].available_clusters.end,
+ shards_info[i].available_clusters.beg, shards_info[i].available_clusters.end));
+ }
+ }
+ return std::nullopt;
+}
+
+}
+
+future<bootstrap_record> bootstrap_record::read_from_disk(block_device& device) {
+ auto bootstrap_record_buff = temporary_buffer<char>::aligned(write_alignment, aligned_bootstrap_record_size);
+ return device.read(bootstrap_record_offset, bootstrap_record_buff.get_write(), aligned_bootstrap_record_size)
+ .then([bootstrap_record_buff = std::move(bootstrap_record_buff)] (size_t ret) {
+ if (ret != aligned_bootstrap_record_size) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format(
+ "Error while reading bootstrap record block, {} bytes read instead of {}",
+ ret, aligned_bootstrap_record_size)));
+ }
+
+ bootstrap_record_disk bootstrap_record_disk;
+ std::memcpy(&bootstrap_record_disk, bootstrap_record_buff.get(), sizeof(bootstrap_record_disk));
+
+ const uint32_t crc_calc = crc32(bootstrap_record_buff.get(), crc_offset);
+ if (crc_calc != bootstrap_record_disk.crc) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format("Invalid CRC, expected crc '{}', read crc '{}'",
+ crc_calc, bootstrap_record_disk.crc)));
+ }
+ if (magic_number != bootstrap_record_disk.magic) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format("Invalid magic number, expected magic '{}', read magic '{}'",
+ magic_number, bootstrap_record_disk.magic)));
+ }
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_alignment(bootstrap_record_disk.alignment)) ||
+ (ret_check = check_cluster_size(bootstrap_record_disk.cluster_size, bootstrap_record_disk.alignment)) ||
+ (ret_check = check_shards_number(bootstrap_record_disk.shards_nb))) {
+ return make_exception_future<bootstrap_record>(ret_check.value());
+ }
+
+ const std::vector<shard_info> tmp_shards_info(bootstrap_record_disk.shards_info,
+ bootstrap_record_disk.shards_info + bootstrap_record_disk.shards_nb);
+
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_shards_info(tmp_shards_info))) {
+ return make_exception_future<bootstrap_record>(ret_check.value());
+ }
+
+ bootstrap_record bootstrap_record_mem(bootstrap_record_disk.version,
+ bootstrap_record_disk.alignment,
+ bootstrap_record_disk.cluster_size,
+ bootstrap_record_disk.root_directory,
+ std::move(tmp_shards_info));
+
+ return make_ready_future<bootstrap_record>(std::move(bootstrap_record_mem));
+ });
+}
+
+future<> bootstrap_record::write_to_disk(block_device& device) const {
+ // initial checks
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_alignment(alignment)) ||
+ (ret_check = check_cluster_size(cluster_size, alignment)) ||
+ (ret_check = check_shards_number(shards_nb())) ||
+ (ret_check = check_shards_info(shards_info))) {
+ return make_exception_future<>(ret_check.value());
+ }
+
+ auto bootstrap_record_buff = temporary_buffer<char>::aligned(write_alignment, aligned_bootstrap_record_size);
+ std::memset(bootstrap_record_buff.get_write(), 0, aligned_bootstrap_record_size);
+ bootstrap_record_disk* bootstrap_record_disk = (struct bootstrap_record_disk*)bootstrap_record_buff.get_write();
+
+ // prepare bootstrap_record_disk records
+ bootstrap_record_disk->magic = bootstrap_record::magic_number;
+ bootstrap_record_disk->version = version;
+ bootstrap_record_disk->alignment = alignment;
+ bootstrap_record_disk->cluster_size = cluster_size;
+ bootstrap_record_disk->root_directory = root_directory;
+ bootstrap_record_disk->shards_nb = shards_nb();
+ std::copy(shards_info.begin(), shards_info.end(), bootstrap_record_disk->shards_info);
+ bootstrap_record_disk->crc = crc32(bootstrap_record_disk, crc_offset);
+
+ return device.write(bootstrap_record_offset, bootstrap_record_buff.get(), aligned_bootstrap_record_size)
+ .then([bootstrap_record_buff = std::move(bootstrap_record_buff)] (size_t ret) {
+ if (ret != aligned_bootstrap_record_size) {
+ return make_exception_future<>(
+ invalid_bootstrap_record(fmt::format(
+ "Error while writing bootstrap record block to disk, {} bytes written instead of {}",
+ ret, aligned_bootstrap_record_size)));
+ }
+ return make_ready_future<>();
+ });
+}
+
+bool operator==(const bootstrap_record& lhs, const bootstrap_record& rhs) noexcept {
+ return lhs.version == rhs.version and lhs.alignment == rhs.alignment and
+ lhs.cluster_size == rhs.cluster_size and lhs.root_directory == rhs.root_directory and
+ lhs.shards_info == rhs.shards_info;
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 891201a3..ca994d42 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -661,9 +661,12 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/file.hh
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
+ src/fs/bootstrap_record.cc
+ src/fs/bootstrap_record.hh
src/fs/cluster.hh
src/fs/cluster_allocator.cc
src/fs/cluster_allocator.hh
+ src/fs/crc.hh

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:28 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
overloaded an useful wrapper that simplifies usage of std:visit over
std::variant. It allows matching variants by type using lambdas in
a similar way that functional languages use.
For details see: https://en.cppreference.com/w/cpp/utility/variant/visit#Example

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/overloaded.hh | 26 ++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 27 insertions(+)
create mode 100644 include/seastar/fs/overloaded.hh

diff --git a/include/seastar/fs/overloaded.hh b/include/seastar/fs/overloaded.hh
new file mode 100644
index 00000000..2a205ba3
--- /dev/null
+++ b/include/seastar/fs/overloaded.hh
@@ -0,0 +1,26 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+// Taken from: https://en.cppreference.com/w/cpp/utility/variant/visit
+template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
+template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
diff --git a/CMakeLists.txt b/CMakeLists.txt
index ca994d42..be3f921f 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -659,6 +659,7 @@ if (Seastar_EXPERIMENTAL_FS)
# SeastarFS source files
include/seastar/fs/block_device.hh
include/seastar/fs/file.hh
+ include/seastar/fs/overloaded.hh
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
src/fs/bootstrap_record.cc
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:29 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
path.hh provides extract_last_component() function that extracts the
last component of the provided path

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/path.hh | 42 ++++++++++++++++++
tests/unit/fs_path_test.cc | 90 ++++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
tests/unit/CMakeLists.txt | 3 ++
4 files changed, 136 insertions(+)
create mode 100644 src/fs/path.hh
create mode 100644 tests/unit/fs_path_test.cc

diff --git a/src/fs/path.hh b/src/fs/path.hh
new file mode 100644
index 00000000..9da4c517
--- /dev/null
+++ b/src/fs/path.hh
@@ -0,0 +1,42 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include <string>
+
+namespace seastar::fs {
+
+// Extracts the last component in @p path. WARNING: The last component is empty iff @p path is empty or ends with '/'
+inline std::string extract_last_component(std::string& path) {
+ auto beg = path.find_last_of('/');
+ if (beg == path.npos) {
+ std::string res = std::move(path);
+ path = {};
+ return res;
+ }
+
+ auto res = path.substr(beg + 1);
+ path.resize(beg + 1);
+ return res;
+}
+
+} // namespace seastar::fs
diff --git a/tests/unit/fs_path_test.cc b/tests/unit/fs_path_test.cc
new file mode 100644
index 00000000..956e64d7
--- /dev/null
+++ b/tests/unit/fs_path_test.cc
@@ -0,0 +1,90 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#include "fs/path.hh"
+
+#define BOOST_TEST_MODULE fs
+#include <boost/test/included/unit_test.hpp>
+
+using namespace seastar::fs;
+
+BOOST_AUTO_TEST_CASE(last_component_simple) {
+ {
+ std::string str = "";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "/";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "/");
+ }
+ {
+ std::string str = "/foo/bar.txt";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "bar.txt");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/.bar";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".bar");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/bar/";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "/foo/bar/");
+ }
+ {
+ std::string str = "/foo/.";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/..";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "..");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "bar.txt";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "bar.txt");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = ".bar";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".bar");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = ".";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "..";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "..");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "//host";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "host");
+ BOOST_REQUIRE_EQUAL(str, "//");
+ }
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index be3f921f..fb8fe32c 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -670,6 +670,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/crc.hh
src/fs/file.cc
src/fs/inode.hh
+ src/fs/path.hh
src/fs/range.hh
src/fs/units.hh
)
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index f9591046..07551b0b 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -372,6 +372,9 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)
+ seastar_add_test (fs_path
+ KIND BOOST
+ SOURCES fs_path_test.cc)

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:30 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
value shared lock is allows to lock (using shared_mutex) a specified value.
One operation locks only one value, but value shared lock allows you to
maintain locks on different values in one place. Also locking is
"on demand" i.e. corresponding shared_mutex will not be created unless a
lock will be used on value and will be deleted as soon as the value is
not being locked by anyone. It serves as a dynamic pool of shared_mutexes
acquired on demand.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/value_shared_lock.hh | 65 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 66 insertions(+)
create mode 100644 src/fs/value_shared_lock.hh

diff --git a/src/fs/value_shared_lock.hh b/src/fs/value_shared_lock.hh
new file mode 100644
index 00000000..6c7a3adf
--- /dev/null
+++ b/src/fs/value_shared_lock.hh
@@ -0,0 +1,65 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/shared_mutex.hh"
+
+#include <map>
+
+namespace seastar::fs {
+
+template<class Value>
+class value_shared_lock {
+ struct lock_info {
+ size_t users_num = 0;
+ shared_mutex lock;
+ };
+
+ std::map<Value, lock_info> _locks;
+
+public:
+ value_shared_lock() = default;
+
+ template<class Func>
+ auto with_shared_on(const Value& val, Func&& func) {
+ auto it = _locks.emplace(val, lock_info {}).first;
+ ++it->second.users_num;
+ return with_shared(it->second.lock, std::forward<Func>(func)).finally([this, it] {
+ if (--it->second.users_num == 0) {
+ _locks.erase(it);
+ }
+ });
+ }
+
+ template<class Func>
+ auto with_lock_on(const Value& val, Func&& func) {
+ auto it = _locks.emplace(val, lock_info {}).first;
+ ++it->second.users_num;
+ return with_lock(it->second.lock, std::forward<Func>(func)).finally([this, it] {
+ if (--it->second.users_num == 0) {
+ _locks.erase(it);
+ }
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/CMakeLists.txt b/CMakeLists.txt
index fb8fe32c..8a59eca6 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -673,6 +673,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/path.hh
src/fs/range.hh
src/fs/units.hh
+ src/fs/value_shared_lock.hh
)
endif()

--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:32 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
Creating unlinked file may be useful as temporary file or to expose the
file via path only after the file is filled with contents.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/metadata_disk_entries.hh | 51 +++++++++++-
src/fs/metadata_log.hh | 6 ++
src/fs/metadata_log_bootstrap.hh | 2 +
.../create_and_open_unlinked_file.hh | 77 +++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 5 ++
src/fs/metadata_log.cc | 21 +++++
src/fs/metadata_log_bootstrap.cc | 13 ++++
CMakeLists.txt | 1 +
8 files changed, 175 insertions(+), 1 deletion(-)
create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 44c2a1c7..437c2c2b 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -27,10 +27,52 @@

namespace seastar::fs {

+struct ondisk_unix_metadata {
+ uint32_t perms;
+ uint32_t uid;
+ uint32_t gid;
+ uint64_t btime_ns;
+ uint64_t mtime_ns;
+ uint64_t ctime_ns;
+} __attribute__((packed));
+
+static_assert(sizeof(decltype(ondisk_unix_metadata::perms)) >= sizeof(decltype(unix_metadata::perms)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::uid)) >= sizeof(decltype(unix_metadata::uid)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::gid)) >= sizeof(decltype(unix_metadata::gid)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::btime_ns)) >= sizeof(decltype(unix_metadata::btime_ns)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::mtime_ns)) >= sizeof(decltype(unix_metadata::mtime_ns)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::ctime_ns)) >= sizeof(decltype(unix_metadata::ctime_ns)));
+
+inline unix_metadata ondisk_metadata_to_metadata(const ondisk_unix_metadata& ondisk_metadata) noexcept {
+ unix_metadata res;
+ static_assert(sizeof(ondisk_metadata) == 36,
+ "metadata size changed: check if above static asserts and below assignments need update");
+ res.perms = static_cast<file_permissions>(ondisk_metadata.perms);
+ res.uid = ondisk_metadata.uid;
+ res.gid = ondisk_metadata.gid;
+ res.btime_ns = ondisk_metadata.btime_ns;
+ res.mtime_ns = ondisk_metadata.mtime_ns;
+ res.ctime_ns = ondisk_metadata.ctime_ns;
+ return res;
+}
+
+inline ondisk_unix_metadata metadata_to_ondisk_metadata(const unix_metadata& metadata) noexcept {
+ ondisk_unix_metadata res;
+ static_assert(sizeof(res) == 36, "metadata size changed: check if below assignments need update");
+ res.perms = static_cast<decltype(res.perms)>(metadata.perms);
+ res.uid = metadata.uid;
+ res.gid = metadata.gid;
+ res.btime_ns = metadata.btime_ns;
+ res.mtime_ns = metadata.mtime_ns;
+ res.ctime_ns = metadata.ctime_ns;
+ return res;
+}
+
enum ondisk_type : uint8_t {
INVALID = 0,
CHECKPOINT,
NEXT_METADATA_CLUSTER,
+ CREATE_INODE,
};

struct ondisk_checkpoint {
@@ -54,9 +96,16 @@ struct ondisk_next_metadata_cluster {
cluster_id_t cluster_id; // metadata log continues there
} __attribute__((packed));

+struct ondisk_create_inode {
+ inode_t inode;
+ uint8_t is_directory;
+ ondisk_unix_metadata metadata;
+} __attribute__((packed));
+
template<typename T>
constexpr size_t ondisk_entry_size(const T& entry) noexcept {
- static_assert(std::is_same_v<T, ondisk_next_metadata_cluster>, "ondisk entry size not defined for given type");
+ static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or
+ std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index c10852a3..6f069c13 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -156,6 +156,8 @@ class metadata_log {

friend class metadata_log_bootstrap;

+ friend class create_and_open_unlinked_file_operation;
+
public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
shared_ptr<metadata_to_disk_buffer> cluster_buff);
@@ -176,6 +178,8 @@ class metadata_log {
return _inodes.count(inode) != 0;
}

+ inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
+
template<class Func>
void schedule_background_task(Func&& task) {
_background_futures = when_all_succeed(_background_futures.get_future(), std::forward<Func>(task));
@@ -286,6 +290,8 @@ class metadata_log {
// Returns size of the file or throws exception iff @p inode is invalid
file_offset_t file_size(inode_t inode) const;

+ future<inode_t> create_and_open_unlinked_file(file_permissions perms);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 5da79631..4a1fa7e9 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -115,6 +115,8 @@ class metadata_log_bootstrap {

bool inode_exists(inode_t inode);

+ future<> bootstrap_create_inode();
+
public:
static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
diff --git a/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh b/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
new file mode 100644
index 00000000..79c5e9f2
--- /dev/null
+++ b/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
@@ -0,0 +1,77 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "seastar/core/future.hh"
+
+namespace seastar::fs {
+
+class create_and_open_unlinked_file_operation {
+ metadata_log& _metadata_log;
+
+ create_and_open_unlinked_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<inode_t> create_and_open_unlinked_file(file_permissions perms) {
+ using namespace std::chrono;
+ uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ unix_metadata unx_mtdt = {
+ perms,
+ 0, // TODO: Eventually, we'll want a user to be able to pass his credentials when bootstrapping the
+ 0, // file system -- that will allow us to authorize users on startup (e.g. via LDAP or whatnot).
+ curr_time_ns,
+ curr_time_ns,
+ curr_time_ns
+ };
+
+ inode_t new_inode = _metadata_log._inode_allocator.alloc();
+ ondisk_create_inode ondisk_entry {
+ new_inode,
+ false,
+ metadata_to_ondisk_metadata(unx_mtdt)
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future<inode_t>(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future<inode_t>(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ inode_info& new_inode_info = _metadata_log.memory_only_create_inode(new_inode, false, unx_mtdt);
+ // We don't have to lock, as there was no context switch since the allocation of the inode number
+ ++new_inode_info.opened_files_count;
+ return make_ready_future<inode_t>(new_inode);
+ }
+ __builtin_unreachable();
+ }
+
+public:
+ static future<inode_t> perform(metadata_log& metadata_log, file_permissions perms) {
+ return do_with(create_and_open_unlinked_file_operation(metadata_log),
+ [perms = std::move(perms)](auto& obj) {
+ return obj.create_and_open_unlinked_file(std::move(perms));
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index bd60f4f3..593ad46a 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -152,6 +152,11 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
}

public:
+ [[nodiscard]] virtual append_result append(const ondisk_create_inode& create_inode) noexcept {
+ // TODO: maybe add a constexpr static field to each ondisk_* entry specifying what type it is?
+ return append_simple(CREATE_INODE, create_inode);
+ }
+
using to_disk_buffer::flush_to_disk;
};

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 6e29f2e5..be523fc7 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -26,6 +26,7 @@
#include "fs/metadata_disk_entries.hh"
#include "fs/metadata_log.hh"
#include "fs/metadata_log_bootstrap.hh"
+#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"
@@ -80,6 +81,22 @@ future<> metadata_log::shutdown() {
});
}

+inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata) {
+ assert(_inodes.count(inode) == 0);
+ return _inodes.emplace(inode, inode_info {
+ 0,
+ 0,
+ metadata,
+ [&]() -> decltype(inode_info::contents) {
+ if (is_directory) {
+ return inode_info::directory {};
+ }
+
+ return inode_info::file {};
+ }()
+ }).first->second;
+}
+
void metadata_log::schedule_flush_of_curr_cluster() {
// Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {
@@ -213,6 +230,10 @@ file_offset_t metadata_log::file_size(inode_t inode) const {
}, it->second.contents);
}

+future<inode_t> metadata_log::create_and_open_unlinked_file(file_permissions perms) {
+ return create_and_open_unlinked_file_operation::perform(*this, std::move(perms));
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 926d79fe..702e0e34 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -211,6 +211,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return invalid_entry_exception();
case NEXT_METADATA_CLUSTER:
return bootstrap_next_metadata_cluster();
+ case CREATE_INODE:
+ return bootstrap_create_inode();
}

// unknown type => metadata log corruption
@@ -242,6 +244,17 @@ bool metadata_log_bootstrap::inode_exists(inode_t inode) {
return _metadata_log._inodes.count(inode) != 0;
}

+future<> metadata_log_bootstrap::bootstrap_create_inode() {
+ ondisk_create_inode entry;
+ if (not _curr_checkpoint.read_entry(entry) or inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ _metadata_log.memory_only_create_inode(entry.inode, entry.is_directory,
+ ondisk_metadata_to_metadata(entry.metadata));
+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
// Clear the metadata log
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 19666a8a..3304a02b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -677,6 +677,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log.hh
src/fs/metadata_log_bootstrap.cc
src/fs/metadata_log_bootstrap.hh
+ src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_to_disk_buffer.hh
src/fs/path.hh
src/fs/range.hh
--
2.26.1

Krzysztof Małysa

<varqox@gmail.com>
unread,
Apr 20, 2020, 8:02:33 AM4/20/20
to seastar-dev@googlegroups.com, Krzysztof Małysa, sarna@scylladb.com, ankezy@gmail.com, quport@gmail.com, wmitros@protonmail.com
SeastarFS is a log-structured filesystem. Every shard will have 3
private logs:
- metadata log
- medium data log
- big data log (this is not actually a log, but in the big picture it
looks like it was)

Disk space is divided into clusters (typically around several MiB) that
have all equal size that is multiple of alignment (typically 4096
bytes). Each shard has its private pool of clusters (assignment is
stored in bootstrap record). Each log consumes clusters one by one -- it
writes the current one and if cluster becomes full, then log switches to
a new one that is obtained from a pool of free clusters managed by
cluster_allocator. Metadata log and medium data log write data in the
same manner: they fill up the cluster gradually from left to right. Big
data log takes a cluster and completely fills it with data at once -- it
is only used during big writes.

This commit adds the skeleton of the metadata log:
- data structures for holding metadata in memory with all operations on
this data structure i.e. manipulating files and their contents
- locking logic (detailed description can be found in metadata_log.hh)
- buffers for writting logs to disk (one for metadata and one for medium
data)
- basic higher level interface e.g. path lookup, iterating over
directory
- boostraping metadata log == reading metadata log from disk and
reconstructing shard's filesystem structure from just before shutdown

File content is stored as a set of data vectors that may have one of
three kinds: in memory data, on disk data, hole. Small writes are
writted directly to the metadata log and because all metadata is stored
in the memory these writes are also in memory, therefore in-memory kind.
Medium and large data are not stored in memory, so they are represented
using on-disk kind. Enlarging file via truncate may produce holes, hence
hole kind.

Directory entries are stored as metadata log entries -- directory inodes
have no content.

To disk buffers buffer data that will be written to disk. There are two
kinds: (normal) to disk buffer and metadata to disk buffer. The latter
is implemented using the former, but provides higher level interface for
appending metadata log entries rather than raw bytes.

Normal to disk buffer appends data sequentially, but if a flush occurs
the offset where next data will be appended is aligned up to alignment
to ensure that writes to the same cluster are non-overlaping.

Metadata to disk buffer appends data using normal to disk buffer but
does some formatting along the way. The structure of the metadata log on
disk is as follows:
| checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... |
| <---- checkpointed data -----> |
etc. Every batch of metadata_log entries is preceded by a checkpoint
entry. Appending metadata log appends the current batch of entries.
Flushing or lack of space ends current batch of entries and then
checkpoint entry is updated (because it holds CRC code of all
checkpointed data) and then write of the whole batch is requested and a
new checkpoint (if there is space for that) is started. Last checkpoint
in a cluster contains a special entry pointing to the next cluster that
is utilized by the metadata log.

Bootstraping is, in fact, just replying of all actions from metadata log
that were saved on disk. It works as follows:
- reads metadata log clusters one by one
- for each cluster, until the last checkpoint contains pointer to the
next cluster, processes the checkpoint and entries it checkpoints
- processing works as follows:
- checkpoint entry is read and if it is invalid it means that the
metadata log ends here (last checkpoint was partially written or the
metadata log really ended here or there was some data corruption...)
and we stop
- if it is correct, it contains the length of the checkpointed data
(metadata log entries), so then we process all of them (error there
indicates that there was data corruption but CRC is still somehow
correct, so we abort all bootstraping with an error)

Locking is to ensure that concurrent modifications of the metadata do
not corrupt it. E.g. Creating a file is a complex operation: you have
to create inode and add a directory entry that will represent this inode
with a path and write corresponding metadata log entries to the disk.
Simultaneous attempts of creating the same file could corrupt the file
system. Not to mention concurrent create and unlink on the same path...
Thus careful and robust locking mechanism is used. For details see
metadata_log.hh.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/exceptions.hh | 88 +++++++++
src/fs/inode_info.hh | 221 ++++++++++++++++++++++
src/fs/metadata_disk_entries.hh | 63 +++++++
src/fs/metadata_log.hh | 295 ++++++++++++++++++++++++++++++
src/fs/metadata_log_bootstrap.hh | 123 +++++++++++++
src/fs/metadata_to_disk_buffer.hh | 158 ++++++++++++++++
src/fs/to_disk_buffer.hh | 138 ++++++++++++++
src/fs/unix_metadata.hh | 40 ++++
src/fs/metadata_log.cc | 222 ++++++++++++++++++++++
src/fs/metadata_log_bootstrap.cc | 264 ++++++++++++++++++++++++++
CMakeLists.txt | 10 +
11 files changed, 1622 insertions(+)
create mode 100644 include/seastar/fs/exceptions.hh
create mode 100644 src/fs/inode_info.hh
create mode 100644 src/fs/metadata_disk_entries.hh
create mode 100644 src/fs/metadata_log.hh
create mode 100644 src/fs/metadata_log_bootstrap.hh
create mode 100644 src/fs/metadata_to_disk_buffer.hh
create mode 100644 src/fs/to_disk_buffer.hh
create mode 100644 src/fs/unix_metadata.hh
create mode 100644 src/fs/metadata_log.cc
create mode 100644 src/fs/metadata_log_bootstrap.cc

diff --git a/include/seastar/fs/exceptions.hh b/include/seastar/fs/exceptions.hh
new file mode 100644
index 00000000..9941f557
--- /dev/null
+++ b/include/seastar/fs/exceptions.hh
@@ -0,0 +1,88 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include <exception>
+
+namespace seastar::fs {
+
+struct fs_exception : public std::exception {
+ const char* what() const noexcept override = 0;
+};
+
+struct cluster_size_too_small_to_perform_operation_exception : public std::exception {
+ const char* what() const noexcept override { return "Cluster size is too small to perform operation"; }
+};
+
+struct invalid_inode_exception : public fs_exception {
+ const char* what() const noexcept override { return "Invalid inode"; }
+};
+
+struct invalid_argument_exception : public fs_exception {
+ const char* what() const noexcept override { return "Invalid argument"; }
+};
+
+struct operation_became_invalid_exception : public fs_exception {
+ const char* what() const noexcept override { return "Operation became invalid"; }
+};
+
+struct no_more_space_exception : public fs_exception {
+ const char* what() const noexcept override { return "No more space on device"; }
+};
+
+struct file_already_exists_exception : public fs_exception {
+ const char* what() const noexcept override { return "File already exists"; }
+};
+
+struct filename_too_long_exception : public fs_exception {
+ const char* what() const noexcept override { return "Filename too long"; }
+};
+
+struct is_directory_exception : public fs_exception {
+ const char* what() const noexcept override { return "Is a directory"; }
+};
+
+struct directory_not_empty_exception : public fs_exception {
+ const char* what() const noexcept override { return "Directory is not empty"; }
+};
+
+struct path_lookup_exception : public fs_exception {
+ const char* what() const noexcept override = 0;
+};
+
+struct path_is_not_absolute_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "Path is not absolute"; }
+};
+
+struct invalid_path_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "Path is invalid"; }
+};
+
+struct no_such_file_or_directory_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "No such file or directory"; }
+};
+
+struct path_component_not_directory_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "A component used as a directory is not a directory"; }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/inode_info.hh b/src/fs/inode_info.hh
new file mode 100644
index 00000000..89bc71d8
--- /dev/null
+++ b/src/fs/inode_info.hh
@@ -0,0 +1,221 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/inode.hh"
+#include "fs/units.hh"
+#include "fs/unix_metadata.hh"
+#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/