[RFC PATCH 01/34] fs: prepare fs/ directory and conditional compilation

191 views
Skip to first unread message

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:18 AM4/20/20
to seast...@googlegroups.com, Piotr Sarna, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
From: Piotr Sarna <sa...@scylladb.com>

This patch provides the initial infrastructure for future
SeastarFS (Seastar filesystem) patches.
Since the project is in very early stage and is going to require
C++17 features, it's disabled by default and can only be enabled
manually by configuring with --enable-experimental-fs
or defining a CMake flag -DSeastar_EXPERIMENTAL_FS=ON.
---
configure.py | 6 ++++++
CMakeLists.txt | 12 ++++++++++++
src/fs/README.md | 10 ++++++++++
3 files changed, 28 insertions(+)
create mode 100644 src/fs/README.md

diff --git a/configure.py b/configure.py
index bbc9f908..6ec350ef 100755
--- a/configure.py
+++ b/configure.py
@@ -106,6 +106,11 @@ add_tristate(
name = 'unused-result-error',
dest = "unused_result_error",
help = 'Make [[nodiscard]] violations an error')
+add_tristate(
+ arg_parser,
+ name = 'experimental-fs',
+ dest = "experimental_fs",
+ help = 'experimental support for SeastarFS')
arg_parser.add_argument('--allocator-page-size', dest='alloc_page_size', type=int, help='override allocator page size')
arg_parser.add_argument('--without-tests', dest='exclude_tests', action='store_true', help='Do not build tests by default')
arg_parser.add_argument('--without-apps', dest='exclude_apps', action='store_true', help='Do not build applications by default')
@@ -201,6 +206,7 @@ def configure_mode(mode):
tr(args.heap_profiling, 'HEAP_PROFILING'),
tr(args.coroutines_ts, 'EXPERIMENTAL_COROUTINES_TS'),
tr(args.unused_result_error, 'UNUSED_RESULT_ERROR'),
+ tr(args.experimental_fs, 'EXPERIMENTAL_FS'),
]

ingredients_to_cook = set(args.cook)
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 39ae46aa..be4f02c8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -281,6 +281,10 @@ set (Seastar_STACK_GUARDS
STRING
"Enable stack guards. Can be ON, OFF or DEFAULT (which enables it for non release builds)")

+option (Seastar_EXPERIMENTAL_FS
+ "Compile experimental SeastarFS sources (requires C++17 support)"
+ OFF)
+
# When Seastar is embedded with `add_subdirectory`, disable the non-library targets.
if (NOT (CMAKE_CURRENT_SOURCE_DIR STREQUAL CMAKE_SOURCE_DIR))
set (Seastar_APPS OFF)
@@ -648,6 +652,14 @@ add_library (seastar STATIC

add_library (Seastar::seastar ALIAS seastar)

+if (Seastar_EXPERIMENTAL_FS)
+ message(STATUS "Experimental SeastarFS is enabled")
+ target_sources(seastar
+ PRIVATE
+ # SeastarFS source files
+ )
+endif()
+
add_dependencies (seastar
seastar_http_request_parser
seastar_http_response_parser
diff --git a/src/fs/README.md b/src/fs/README.md
new file mode 100644
index 00000000..630f34a8
--- /dev/null
+++ b/src/fs/README.md
@@ -0,0 +1,10 @@
+### SeastarFS ###
+
+SeastarFS is an R&D project aimed at providing a fully asynchronous,
+log-structured, shard-friendly file system optimized for large files
+and with native Seastar support.
+
+Source files residing in this directory will be compiled only
+by setting an appropriate flag.
+ninja: ./configure.py --enable-experimental-fs
+CMake: -DSeastar\_EXPERIMENTAL\_FS=ON
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:18 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
github: https://github.com/psarna/seastar/commits/fs-metadata-log

This series is part of the ZPP FS project that is coordinated by Piotr Sarna <sa...@scylladb.com>.
The goal of this project is to create SeastarFS -- a fully asynchronous, sharded, user-space,
log-structured file system that is intended to become an alternative to XFS for Scylla.

The filesystem is optimized for:
- NVMe SSD storage
- large files
- appending files

For efficiency all metadata is stored in the memory. Metadata holds information about where each
part of the file is located and about the directory tree.

Whole filesystem is divided into filesystem shards (typically the same as number of seastar shards
for efficiency). Each shard holds its own part of the filesystem. Sharding is done by the fact that
each shard has its set of root subdirectories that it exclusively owns (one of the shards is an
owner of the root directory itself).

Every shard will have 3 private logs:
- metadata log -- holds metadata and very small writes
- medium data log -- holds medium-sized writes, which can combine data from different files
- big data log -- holds data clusters, each of which belongs to a single file (this is not
actually a log, but in the big picture it looks like it was)

Disk space is divided into clusters (typically around several MiB) that
have all equal size that is multiple of alignment (typically 4096
bytes). Each shard has its private pool of clusters (assignment is
stored in bootstrap record). Each log consumes clusters one by one -- it
writes the current one and if cluster becomes full, then log switches to
a new one that is obtained from a pool of free clusters managed by
cluster_allocator. Metadata log and medium data log write data in the
same manner: they fill up the cluster gradually from left to right. Big
data log takes a cluster and completely fills it with data at once -- it
is only used during big writes.

metadata_log is in fact a standalone file system instance that provides lower level interface
(paths and inodes) of shard's own part of the filesystem. It manages all 3 logs mentioned above and
maintains all metadata about its part of the file system that include data structures for directory
structure and file content, locking logic for safe concurrent usage, buffers for writing logs to
disk, and bootstrapping -- restoring file system structure from disk.

This patch implements:
- bootstrap record -- our equivalent of the filesystem superblock -- it contains information like
size of the block device, number of filesystem shards, cluster distribution among shards
- cluster allocator for managing free clusters within one metadata_log
- fully functional metadata_log that will be one shard's part of the filesystem
- bootstrapping metadata_log
- creating / deleting file and directories (+ support for unlinked files)
- reading, writing and truncating files
- opening and closing files
- linking files (but not directories)
- iterating directory and getting file attributes
- tests of some components and functionality of the metadata_log and bootstrap record

What is not here, but we plan on pushing it later:
- compaction
- filesystem sharding
- renaming files

Tests: unit(dev)

Aleksander Sorokin (3):
fs: add initial file implementation
tests: fs: add parallel i/o unit test for seastarfs file
tests: fs: add basic test for metadata log bootstrapping

Krzysztof Małysa (14):
fs: add initial block_device implementation
fs: add temporary_file
tests: fs: add block_device unit test
fs: add unit headers
fs: add seastar/fs/overloaded.hh
fs: add seastar/fs/path.hh with unit tests
fs: add value_shared_lock.hh
fs: metadata_log: add base implementation
fs: metadata_log: add operation for creating and opening unlinked file
fs: metadata_log: add creating files and directories
fs: metadata_log: add private operation for deleting inode
fs: metadata_log: add link operation
fs: metadata_log: add unlinking files and removing directories
fs: metadata_log: add stat() operation

Michał Niciejewski (10):
fs: add bootstrap record implementation
tests: fs: add tests for bootstrap record
fs: metadata_log: add opening file
fs: metadata_log: add closing file
fs: metadata_log: add write operation
fs: metadata_log: add read operation
tests: fs: added metadata_to_disk_buffer and cluster_writer mockers
tests: fs: add write test
fs: read: add optimization for aligned reads
tests: fs: add tests for aligned reads and writes

Piotr Sarna (1):
fs: prepare fs/ directory and conditional compilation

Wojciech Mitros (6):
fs: add cluster allocator
fs: add cluster allocator tests
fs: metadata_log: add truncate operation
tests: fs: add to_disk_buffer test
tests: fs: add truncate operation test
tests: fs: add metadata_to_disk_buffer unit tests

configure.py | 6 +
include/seastar/fs/block_device.hh | 102 +++
include/seastar/fs/exceptions.hh | 88 ++
include/seastar/fs/file.hh | 55 ++
include/seastar/fs/overloaded.hh | 26 +
include/seastar/fs/stat.hh | 41 +
include/seastar/fs/temporary_file.hh | 54 ++
src/fs/bitwise.hh | 125 +++
src/fs/bootstrap_record.hh | 98 ++
src/fs/cluster.hh | 42 +
src/fs/cluster_allocator.hh | 50 ++
src/fs/cluster_writer.hh | 85 ++
src/fs/crc.hh | 34 +
src/fs/device_reader.hh | 91 ++
src/fs/inode.hh | 80 ++
src/fs/inode_info.hh | 221 +++++
src/fs/metadata_disk_entries.hh | 208 +++++
src/fs/metadata_log.hh | 362 ++++++++
src/fs/metadata_log_bootstrap.hh | 145 +++
.../create_and_open_unlinked_file.hh | 77 ++
src/fs/metadata_log_operations/create_file.hh | 174 ++++
src/fs/metadata_log_operations/link_file.hh | 112 +++
src/fs/metadata_log_operations/read.hh | 138 +++
src/fs/metadata_log_operations/truncate.hh | 90 ++
.../unlink_or_remove_file.hh | 196 ++++
src/fs/metadata_log_operations/write.hh | 318 +++++++
src/fs/metadata_to_disk_buffer.hh | 244 +++++
src/fs/path.hh | 42 +
src/fs/range.hh | 61 ++
src/fs/to_disk_buffer.hh | 138 +++
src/fs/units.hh | 40 +
src/fs/unix_metadata.hh | 40 +
src/fs/value_shared_lock.hh | 65 ++
tests/unit/fs_metadata_common.hh | 467 ++++++++++
tests/unit/fs_mock_block_device.hh | 55 ++
tests/unit/fs_mock_cluster_writer.hh | 78 ++
tests/unit/fs_mock_metadata_to_disk_buffer.hh | 323 +++++++
src/fs/bootstrap_record.cc | 206 +++++
src/fs/cluster_allocator.cc | 54 ++
src/fs/device_reader.cc | 199 +++++
src/fs/file.cc | 108 +++
src/fs/metadata_log.cc | 525 +++++++++++
src/fs/metadata_log_bootstrap.cc | 552 ++++++++++++
tests/unit/fs_block_device_test.cc | 206 +++++
tests/unit/fs_bootstrap_record_test.cc | 414 +++++++++
tests/unit/fs_cluster_allocator_test.cc | 115 +++
tests/unit/fs_log_bootstrap_test.cc | 86 ++
tests/unit/fs_metadata_to_disk_buffer_test.cc | 462 ++++++++++
tests/unit/fs_mock_block_device.cc | 50 ++
.../fs_mock_metadata_to_disk_buffer_test.cc | 357 ++++++++
tests/unit/fs_path_test.cc | 90 ++
tests/unit/fs_seastarfs_test.cc | 62 ++
tests/unit/fs_to_disk_buffer_test.cc | 160 ++++
tests/unit/fs_truncate_test.cc | 171 ++++
tests/unit/fs_write_test.cc | 835 ++++++++++++++++++
CMakeLists.txt | 50 ++
src/fs/README.md | 10 +
tests/unit/CMakeLists.txt | 42 +
58 files changed, 9325 insertions(+)
create mode 100644 include/seastar/fs/block_device.hh
create mode 100644 include/seastar/fs/exceptions.hh
create mode 100644 include/seastar/fs/file.hh
create mode 100644 include/seastar/fs/overloaded.hh
create mode 100644 include/seastar/fs/stat.hh
create mode 100644 include/seastar/fs/temporary_file.hh
create mode 100644 src/fs/bitwise.hh
create mode 100644 src/fs/bootstrap_record.hh
create mode 100644 src/fs/cluster.hh
create mode 100644 src/fs/cluster_allocator.hh
create mode 100644 src/fs/cluster_writer.hh
create mode 100644 src/fs/crc.hh
create mode 100644 src/fs/device_reader.hh
create mode 100644 src/fs/inode.hh
create mode 100644 src/fs/inode_info.hh
create mode 100644 src/fs/metadata_disk_entries.hh
create mode 100644 src/fs/metadata_log.hh
create mode 100644 src/fs/metadata_log_bootstrap.hh
create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
create mode 100644 src/fs/metadata_log_operations/create_file.hh
create mode 100644 src/fs/metadata_log_operations/link_file.hh
create mode 100644 src/fs/metadata_log_operations/read.hh
create mode 100644 src/fs/metadata_log_operations/truncate.hh
create mode 100644 src/fs/metadata_log_operations/unlink_or_remove_file.hh
create mode 100644 src/fs/metadata_log_operations/write.hh
create mode 100644 src/fs/metadata_to_disk_buffer.hh
create mode 100644 src/fs/path.hh
create mode 100644 src/fs/range.hh
create mode 100644 src/fs/to_disk_buffer.hh
create mode 100644 src/fs/units.hh
create mode 100644 src/fs/unix_metadata.hh
create mode 100644 src/fs/value_shared_lock.hh
create mode 100644 tests/unit/fs_metadata_common.hh
create mode 100644 tests/unit/fs_mock_block_device.hh
create mode 100644 tests/unit/fs_mock_cluster_writer.hh
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer.hh
create mode 100644 src/fs/bootstrap_record.cc
create mode 100644 src/fs/cluster_allocator.cc
create mode 100644 src/fs/device_reader.cc
create mode 100644 src/fs/file.cc
create mode 100644 src/fs/metadata_log.cc
create mode 100644 src/fs/metadata_log_bootstrap.cc
create mode 100644 tests/unit/fs_block_device_test.cc
create mode 100644 tests/unit/fs_bootstrap_record_test.cc
create mode 100644 tests/unit/fs_cluster_allocator_test.cc
create mode 100644 tests/unit/fs_log_bootstrap_test.cc
create mode 100644 tests/unit/fs_metadata_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_mock_block_device.cc
create mode 100644 tests/unit/fs_mock_metadata_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_path_test.cc
create mode 100644 tests/unit/fs_seastarfs_test.cc
create mode 100644 tests/unit/fs_to_disk_buffer_test.cc
create mode 100644 tests/unit/fs_truncate_test.cc
create mode 100644 tests/unit/fs_write_test.cc
create mode 100644 src/fs/README.md

--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:19 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
block_device is an abstraction over an opened block device file or
opened ordinary file of fixed size. It offers:
- openning and closing file (block device)
- aligned reads and writes
- flushing

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/block_device.hh | 102 +++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 103 insertions(+)
create mode 100644 include/seastar/fs/block_device.hh

diff --git a/include/seastar/fs/block_device.hh b/include/seastar/fs/block_device.hh
new file mode 100644
index 00000000..31822617
--- /dev/null
+++ b/include/seastar/fs/block_device.hh
@@ -0,0 +1,102 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/file.hh"
+#include "seastar/core/reactor.hh"
+
+namespace seastar::fs {
+
+class block_device_impl {
+public:
+ virtual ~block_device_impl() = default;
+
+ virtual future<size_t> write(uint64_t aligned_pos, const void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) = 0;
+ virtual future<size_t> read(uint64_t aligned_pos, void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) = 0;
+ virtual future<> flush() = 0;
+ virtual future<> close() = 0;
+};
+
+class block_device {
+ shared_ptr<block_device_impl> _block_device_impl;
+public:
+ block_device(shared_ptr<block_device_impl> impl) noexcept : _block_device_impl(std::move(impl)) {}
+
+ block_device() = default;
+
+ block_device(const block_device&) = default;
+ block_device(block_device&&) noexcept = default;
+ block_device& operator=(const block_device&) noexcept = default;
+ block_device& operator=(block_device&&) noexcept = default;
+
+ explicit operator bool() const noexcept { return bool(_block_device_impl); }
+
+ template <typename CharType>
+ future<size_t> read(uint64_t aligned_offset, CharType* aligned_buffer, size_t aligned_len, const io_priority_class& pc = default_priority_class()) {
+ return _block_device_impl->read(aligned_offset, aligned_buffer, aligned_len, pc);
+ }
+
+ template <typename CharType>
+ future<size_t> write(uint64_t aligned_offset, const CharType* aligned_buffer, size_t aligned_len, const io_priority_class& pc = default_priority_class()) {
+ return _block_device_impl->write(aligned_offset, aligned_buffer, aligned_len, pc);
+ }
+
+ future<> flush() {
+ return _block_device_impl->flush();
+ }
+
+ future<> close() {
+ return _block_device_impl->close();
+ }
+};
+
+class file_block_device_impl : public block_device_impl {
+ file _file;
+public:
+ explicit file_block_device_impl(file f) : _file(std::move(f)) {}
+
+ ~file_block_device_impl() override = default;
+
+ future<size_t> write(uint64_t aligned_pos, const void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) override {
+ return _file.dma_write(aligned_pos, aligned_buffer, aligned_len, pc);
+ }
+
+ future<size_t> read(uint64_t aligned_pos, void* aligned_buffer, size_t aligned_len, const io_priority_class& pc) override {
+ return _file.dma_read(aligned_pos, aligned_buffer, aligned_len, pc);
+ }
+
+ future<> flush() override {
+ return _file.flush();
+ }
+
+ future<> close() override {
+ return _file.close();
+ }
+};
+
+inline future<block_device> open_block_device(std::string name) {
+ return open_file_dma(std::move(name), open_flags::rw).then([](file f) {
+ return block_device(make_shared<file_block_device_impl>(std::move(f)));
+ });
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index be4f02c8..b50abf99 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -657,6 +657,7 @@ if (Seastar_EXPERIMENTAL_FS)
target_sources(seastar
PRIVATE
# SeastarFS source files
+ include/seastar/fs/block_device.hh
)
endif()

--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:21 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
temporary_file is a handle to a temprorary file with a path.
It creates temporary file upon construction and removes it upon
destruction.
The main use case is testing the file system on a temporary file that
simulates a block device.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/temporary_file.hh | 54 ++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 55 insertions(+)
create mode 100644 include/seastar/fs/temporary_file.hh

diff --git a/include/seastar/fs/temporary_file.hh b/include/seastar/fs/temporary_file.hh
new file mode 100644
index 00000000..c00282d9
--- /dev/null
+++ b/include/seastar/fs/temporary_file.hh
@@ -0,0 +1,54 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/posix.hh"
+
+#include <string>
+
+namespace seastar::fs {
+
+class temporary_file {
+ std::string _path;
+
+public:
+ explicit temporary_file(std::string path) : _path(std::move(path) + ".XXXXXX") {
+ int fd = mkstemp(_path.data());
+ throw_system_error_on(fd == -1);
+ close(fd);
+ }
+
+ ~temporary_file() {
+ unlink(_path.data());
+ }
+
+ temporary_file(const temporary_file&) = delete;
+ temporary_file& operator=(const temporary_file&) = delete;
+ temporary_file(temporary_file&&) noexcept = delete;
+ temporary_file& operator=(temporary_file&&) noexcept = delete;
+
+ const std::string& path() const noexcept {
+ return _path;
+ }
+};
+
+} // namespace seastar::fs
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 0ba7ee35..39d11ad8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -659,6 +659,7 @@ if (Seastar_EXPERIMENTAL_FS)
# SeastarFS source files
include/seastar/fs/block_device.hh
include/seastar/fs/file.hh
+ include/seastar/fs/temporary_file.hh
src/fs/file.cc
)
endif()
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:21 AM4/20/20
to seast...@googlegroups.com, Aleksander Sorokin, sa...@scylladb.com, qup...@gmail.com, wmi...@protonmail.com
From: Aleksander Sorokin <ank...@gmail.com>

Currently the only implementation of Seastar's file abstraction is
`posix_file_impl`. This patch provides another implementation, which keeps
a reference to our file system's metadata in it and uses the `block_device`
handle underneath. Implemented `seastarfs_file_impl`, which derives from
`file_impl` and provides a stub interface. At the moment it’s extremely
oversimplified and just treat the whole block device as a one huge file.
Along with it, provided a free function for creating this handle.

Signed-off-by: Aleksander Sorokin <ank...@gmail.com>
---
include/seastar/fs/file.hh | 55 +++++++++++++++++++
src/fs/file.cc | 108 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 2 +
3 files changed, 165 insertions(+)
create mode 100644 include/seastar/fs/file.hh
create mode 100644 src/fs/file.cc

diff --git a/include/seastar/fs/file.hh b/include/seastar/fs/file.hh
new file mode 100644
index 00000000..ae38f3a4
--- /dev/null
+++ b/include/seastar/fs/file.hh
@@ -0,0 +1,55 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/file.hh"
+#include "seastar/core/future.hh"
+#include "seastar/fs/block_device.hh"
+
+namespace seastar::fs {
+
+class seastarfs_file_impl : public file_impl {
+ block_device _block_device;
+ open_flags _open_flags;
+public:
+ seastarfs_file_impl(block_device dev, open_flags flags);
+ ~seastarfs_file_impl() override = default;
+
+ future<size_t> write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) override;
+ future<size_t> write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override;
+ future<size_t> read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) override;
+ future<size_t> read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) override;
+ future<> flush() override;
+ future<struct stat> stat() override;
+ future<> truncate(uint64_t length) override;
+ future<> discard(uint64_t offset, uint64_t length) override;
+ future<> allocate(uint64_t position, uint64_t length) override;
+ future<uint64_t> size() override;
+ future<> close() noexcept override;
+ std::unique_ptr<file_handle_impl> dup() override;
+ subscription<directory_entry> list_directory(std::function<future<> (directory_entry de)> next) override;
+ future<temporary_buffer<uint8_t>> dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) override;
+};
+
+future<file> open_file_dma(std::string name, open_flags flags);
+
+}
diff --git a/src/fs/file.cc b/src/fs/file.cc
new file mode 100644
index 00000000..4f4e0ac6
--- /dev/null
+++ b/src/fs/file.cc
@@ -0,0 +1,108 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "seastar/core/future.hh"
+#include "seastar/fs/block_device.hh"
+#include "seastar/fs/file.hh"
+
+namespace seastar::fs {
+
+seastarfs_file_impl::seastarfs_file_impl(block_device dev, open_flags flags)
+ : _block_device(std::move(dev))
+ , _open_flags(flags) {}
+
+future<size_t>
+seastarfs_file_impl::write_dma(uint64_t pos, const void* buffer, size_t len, const io_priority_class& pc) {
+ return _block_device.write(pos, buffer, len, pc);
+}
+
+future<size_t>
+seastarfs_file_impl::write_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<size_t>
+seastarfs_file_impl::read_dma(uint64_t pos, void* buffer, size_t len, const io_priority_class& pc) {
+ return _block_device.read(pos, buffer, len, pc);
+}
+
+future<size_t>
+seastarfs_file_impl::read_dma(uint64_t pos, std::vector<iovec> iov, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::flush() {
+ return _block_device.flush();
+}
+
+future<struct stat>
+seastarfs_file_impl::stat() {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::truncate(uint64_t) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::discard(uint64_t offset, uint64_t length) {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::allocate(uint64_t position, uint64_t length) {
+ throw std::bad_function_call();
+}
+
+future<uint64_t>
+seastarfs_file_impl::size() {
+ throw std::bad_function_call();
+}
+
+future<>
+seastarfs_file_impl::close() noexcept {
+ return _block_device.close();
+}
+
+std::unique_ptr<file_handle_impl>
+seastarfs_file_impl::dup() {
+ throw std::bad_function_call();
+}
+
+subscription<directory_entry>
+seastarfs_file_impl::list_directory(std::function<future<> (directory_entry de)> next) {
+ throw std::bad_function_call();
+}
+
+future<temporary_buffer<uint8_t>>
+seastarfs_file_impl::dma_read_bulk(uint64_t offset, size_t range_size, const io_priority_class& pc) {
+ throw std::bad_function_call();
+}
+
+future<file> open_file_dma(std::string name, open_flags flags) {
+ return open_block_device(std::move(name)).then([flags] (block_device bd) {
+ return file(make_shared<seastarfs_file_impl>(std::move(bd), flags));
+ });
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index b50abf99..0ba7ee35 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -658,6 +658,8 @@ if (Seastar_EXPERIMENTAL_FS)
PRIVATE
# SeastarFS source files
include/seastar/fs/block_device.hh
+ include/seastar/fs/file.hh
+ src/fs/file.cc
)
endif()

--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:22 AM4/20/20
to seast...@googlegroups.com, Aleksander Sorokin, sa...@scylladb.com, qup...@gmail.com, wmi...@protonmail.com
From: Aleksander Sorokin <ank...@gmail.com>

Added first crude unit test for seastarfs_file_impl:
paralleel writing with newly created handle and reading the same data back.

Signed-off-by: Aleksander Sorokin <ank...@gmail.com>
---
tests/unit/fs_seastarfs_test.cc | 62 +++++++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 5 +++
2 files changed, 67 insertions(+)
create mode 100644 tests/unit/fs_seastarfs_test.cc

diff --git a/tests/unit/fs_seastarfs_test.cc b/tests/unit/fs_seastarfs_test.cc
new file mode 100644
index 00000000..25c3e8d5
--- /dev/null
+++ b/tests/unit/fs_seastarfs_test.cc
@@ -0,0 +1,62 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "seastar/core/aligned_buffer.hh"
+#include "seastar/core/file-types.hh"
+#include "seastar/core/file.hh"
+#include "seastar/core/thread.hh"
+#include "seastar/core/units.hh"
+#include "seastar/fs/file.hh"
+#include "seastar/fs/temporary_file.hh"
+#include "seastar/testing/thread_test_case.hh"
+
+using namespace seastar;
+using namespace fs;
+
+constexpr auto device_path = "/tmp/seastarfs";
+constexpr auto device_size = 16 * MB;
+
+SEASTAR_THREAD_TEST_CASE(parallel_read_write_test) {
+ const auto tf = temporary_file(device_path);
+ auto f = fs::open_file_dma(tf.path(), open_flags::rw).get0();
+ static auto alignment = f.memory_dma_alignment();
+
+ parallel_for_each(boost::irange<off_t>(0, device_size / alignment), [&f](auto i) {
+ auto wbuf = allocate_aligned_buffer<unsigned char>(alignment, alignment);
+ std::fill(wbuf.get(), wbuf.get() + alignment, i);
+ auto wb = wbuf.get();
+
+ return f.dma_write(i * alignment, wb, alignment).then(
+ [&f, i, wbuf = std::move(wbuf)](auto ret) mutable {
+ BOOST_REQUIRE_EQUAL(ret, alignment);
+ auto rbuf = allocate_aligned_buffer<unsigned char>(alignment, alignment);
+ auto rb = rbuf.get();
+ return f.dma_read(i * alignment, rb, alignment).then(
+ [f, rbuf = std::move(rbuf), wbuf = std::move(wbuf)](auto ret) {
+ BOOST_REQUIRE_EQUAL(ret, alignment);
+ BOOST_REQUIRE(std::equal(rbuf.get(), rbuf.get() + alignment, wbuf.get()));
+ });
+ });
+ }).wait();
+
+ f.flush().wait();
+ f.close().wait();
+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index 8f203721..f2c5187f 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -361,6 +361,11 @@ seastar_add_test (rpc
loopback_socket.hh
rpc_test.cc)

+if (Seastar_EXPERIMENTAL_FS)
+ seastar_add_test (fs_seastarfs
+ SOURCES fs_seastarfs_test.cc)
+endif()
+
seastar_add_test (semaphore
SOURCES semaphore_test.cc)

--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:23 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
What is tested:
- simple reads and writes
- parallel non-overlaping writes then parallel non-overlaping reads
- random and simultaneous reads and writes

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
tests/unit/fs_block_device_test.cc | 206 +++++++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 3 +
2 files changed, 209 insertions(+)
create mode 100644 tests/unit/fs_block_device_test.cc

diff --git a/tests/unit/fs_block_device_test.cc b/tests/unit/fs_block_device_test.cc
new file mode 100644
index 00000000..6887005c
--- /dev/null
+++ b/tests/unit/fs_block_device_test.cc
@@ -0,0 +1,206 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "seastar/core/do_with.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/temporary_buffer.hh"
+
+#include <boost/range/irange.hpp>
+#include <random>
+#include <seastar/core/app-template.hh>
+#include <seastar/core/thread.hh>
+#include <seastar/core/units.hh>
+#include <seastar/fs/block_device.hh>
+#include <seastar/fs/temporary_file.hh>
+#include <seastar/testing/test_runner.hh>
+#include <unistd.h>
+
+using namespace seastar;
+using namespace seastar::fs;
+
+constexpr off_t min_device_size = 16*MB;
+constexpr size_t alignment = 4*KB;
+
+static future<temporary_buffer<char>> allocate_random_aligned_buffer(size_t size) {
+ return do_with(temporary_buffer<char>::aligned(alignment, size),
+ std::default_random_engine(testing::local_random_engine()), [size](auto& buffer, auto& random_engine) {
+ return do_for_each(buffer.get_write(), buffer.get_write() + size, [&](char& c) {
+ std::uniform_int_distribution<> character(0, sizeof(char) * 8 - 1);
+ c = character(random_engine);
+ }).then([&buffer] {
+ return std::move(buffer);
+ });
+ });
+}
+
+static future<> test_basic_read_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*KB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+ auto check_buffer = allocate_random_aligned_buffer(buff_size).get0();
+
+ // Write and read
+ assert(dev.write(0, buffer.get(), buff_size).get0() == buff_size);
+ assert(dev.read(0, check_buffer.get_write(), buff_size).get0() == buff_size);
+ assert(std::memcmp(buffer.get(), check_buffer.get(), buff_size) == 0);
+
+ // Data have to remain after closing
+ dev.close().get0();
+ dev = open_block_device(device_path).get0();
+ check_buffer = allocate_random_aligned_buffer(buff_size).get0(); // Make sure the buffer is written
+ assert(dev.read(0, check_buffer.get_write(), buff_size).get0() == buff_size);
+ assert(std::memcmp(buffer.get(), check_buffer.get(), buff_size) == 0);
+
+ dev.close().get0();
+ });
+}
+
+static future<> test_parallel_read_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*MB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+
+ // Write
+ static_assert(buff_size % alignment == 0);
+ parallel_for_each(boost::irange<off_t>(0, buff_size / alignment), [&](off_t block_no) {
+ off_t offset = block_no * alignment;
+ return dev.write(offset, buffer.get() + offset, alignment).then([](size_t written) {
+ assert(written == alignment);
+ });
+ }).get0();
+
+ // Read
+ static_assert(buff_size % alignment == 0);
+ parallel_for_each(boost::irange<off_t>(0, buff_size / alignment), [&](off_t block_no) {
+ return async([&dev, &buffer, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ assert(std::memcmp(buffer.get() + offset, check_buffer.get(), alignment) == 0);
+ });
+ }).get0();
+
+ dev.close().get0();
+ });
+}
+
+static future<> test_simultaneous_parallel_read_and_write(const std::string& device_path) {
+ return async([&] {
+ block_device dev = open_block_device(device_path).get0();
+ constexpr size_t buff_size = 16*MB;
+ auto buffer = allocate_random_aligned_buffer(buff_size).get0();
+ assert(dev.write(0, buffer.get(), buff_size).get0() == buff_size);
+
+ static_assert(buff_size % alignment == 0);
+ size_t blocks_num = buff_size / alignment;
+ enum Kind { WRITE, READ };
+ std::vector<Kind> block_kind(blocks_num);
+ std::default_random_engine random_engine(testing::local_random_engine());
+ std::uniform_int_distribution<> choose_write(0, 1);
+ for (Kind& kind : block_kind) {
+ kind = (choose_write(random_engine) ? WRITE : READ);
+ }
+
+ // Perform simultaneous reads and writes
+ auto new_buffer = allocate_random_aligned_buffer(buff_size).get0();
+ auto write_fut = parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ if (block_kind[block_no] != WRITE) {
+ return now();
+ }
+
+ off_t offset = block_no * alignment;
+ return dev.write(offset, new_buffer.get() + offset, alignment).then([](size_t written) {
+ assert(written == alignment);
+ });
+ });
+ auto read_fut = parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ if (block_kind[block_no] != READ) {
+ return now();
+ }
+
+ return async([&dev, &buffer, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ assert(std::memcmp(buffer.get() + offset, check_buffer.get(), alignment) == 0);
+ });
+ });
+
+ when_all_succeed(std::move(write_fut), std::move(read_fut)).get0();
+
+ // Check that writes were made in the correct places
+ parallel_for_each(boost::irange<off_t>(0, blocks_num), [&](off_t block_no) {
+ return async([&dev, &buffer, &new_buffer, &block_kind, block_no] {
+ off_t offset = block_no * alignment;
+ auto check_buffer = allocate_random_aligned_buffer(alignment).get0();
+ assert(dev.read(offset, check_buffer.get_write(), alignment).get0() == alignment);
+ auto& orig_buff = (block_kind[block_no] == WRITE ? new_buffer : buffer);
+ assert(std::memcmp(orig_buff.get() + offset, check_buffer.get(), alignment) == 0);
+ });
+ }).get0();
+
+ dev.close().get0();
+ });
+}
+
+static future<> prepare_file(const std::string& file_path) {
+ return async([&] {
+ // Create device file if it does exist
+ file dev = open_file_dma(file_path, open_flags::rw | open_flags::create).get0();
+
+ auto st = dev.stat().get0();
+ if (S_ISREG(st.st_mode) and st.st_size < min_device_size) {
+ dev.truncate(min_device_size).get0();
+ }
+
+ dev.close().get0();
+ });
+}
+
+int main(int argc, char** argv) {
+ app_template app;
+ app.add_options()
+ ("help", "produce this help message")
+ ("dev", boost::program_options::value<std::string>(),
+ "optional path to device file to test block_device on");
+ return app.run(argc, argv, [&app] {
+ return async([&] {
+ auto& args = app.configuration();
+ std::optional<temporary_file> tmp_device_file;
+ std::string device_path = [&]() -> std::string {
+ if (args.count("dev")) {
+ return args["dev"].as<std::string>();
+ }
+
+ tmp_device_file.emplace("/tmp/block_device_test_file");
+ return tmp_device_file->path();
+ }();
+
+ assert(not device_path.empty());
+ prepare_file(device_path).get0();
+ test_basic_read_write(device_path).get0();
+ test_parallel_read_write(device_path).get0();
+ test_simultaneous_parallel_read_and_write(device_path).get0();
+ });
+ });
+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index f2c5187f..21e564fb 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -362,6 +362,9 @@ seastar_add_test (rpc
rpc_test.cc)

if (Seastar_EXPERIMENTAL_FS)
+ seastar_add_app_test (fs_block_device
+ SOURCES fs_block_device_test.cc
+ LIBRARIES seastar_testing)
seastar_add_test (fs_seastarfs
SOURCES fs_seastarfs_test.cc)
endif()
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:24 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
- units.hh: basic units
- cluster.hh: cluster_id unit and operations on it (converting cluster
ids to offsets)
- inode.hh: inode unit and operations on it (extracting shard_no from
inode and allocating new inode)
- bitwise.hh: bitwise operations
- range.hh: range abstraction

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/bitwise.hh | 125 ++++++++++++++++++++++++++++++++++++++++++++++
src/fs/cluster.hh | 42 ++++++++++++++++
src/fs/inode.hh | 80 +++++++++++++++++++++++++++++
src/fs/range.hh | 61 ++++++++++++++++++++++
src/fs/units.hh | 40 +++++++++++++++
CMakeLists.txt | 5 ++
6 files changed, 353 insertions(+)
create mode 100644 src/fs/bitwise.hh
create mode 100644 src/fs/cluster.hh
create mode 100644 src/fs/inode.hh
create mode 100644 src/fs/range.hh
create mode 100644 src/fs/units.hh

diff --git a/src/fs/bitwise.hh b/src/fs/bitwise.hh
new file mode 100644
index 00000000..e53c1919
--- /dev/null
+++ b/src/fs/bitwise.hh
@@ -0,0 +1,125 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include <cassert>
+#include <type_traits>
+
+namespace seastar::fs {
+
+template<class T, std::enable_if_t<std::is_unsigned_v<T>, int> = 0>
+constexpr inline bool is_power_of_2(T x) noexcept {
+ return (x > 0 and (x & (x - 1)) == 0);
+}
+
+static_assert(not is_power_of_2(0u));
+static_assert(is_power_of_2(1u));
+static_assert(is_power_of_2(2u));
+static_assert(not is_power_of_2(3u));
+static_assert(is_power_of_2(4u));
+static_assert(not is_power_of_2(5u));
+static_assert(not is_power_of_2(6u));
+static_assert(not is_power_of_2(7u));
+static_assert(is_power_of_2(8u));
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T div_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a >> __builtin_ctzll(b)); // should be 2 CPU cycles after inlining on modern x86_64
+}
+
+static_assert(div_by_power_of_2(13u, 1u) == 13);
+static_assert(div_by_power_of_2(12u, 4u) == 3);
+static_assert(div_by_power_of_2(42u, 32u) == 1);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T mod_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a & (b - 1));
+}
+
+static_assert(mod_by_power_of_2(13u, 1u) == 0);
+static_assert(mod_by_power_of_2(42u, 32u) == 10);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T mul_by_power_of_2(T a, U b) noexcept {
+ assert(is_power_of_2(b));
+ return (a << __builtin_ctzll(b)); // should be 2 CPU cycles after inlining on modern x86_64
+}
+
+static_assert(mul_by_power_of_2(3u, 1u) == 3);
+static_assert(mul_by_power_of_2(3u, 4u) == 12);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T round_down_to_multiple_of_power_of_2(T a, U b) noexcept {
+ return a - mod_by_power_of_2(a, b);
+}
+
+static_assert(round_down_to_multiple_of_power_of_2(0u, 1u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(1u, 1u) == 1);
+static_assert(round_down_to_multiple_of_power_of_2(19u, 1u) == 19);
+
+static_assert(round_down_to_multiple_of_power_of_2(0u, 2u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(1u, 2u) == 0);
+static_assert(round_down_to_multiple_of_power_of_2(2u, 2u) == 2);
+static_assert(round_down_to_multiple_of_power_of_2(3u, 2u) == 2);
+static_assert(round_down_to_multiple_of_power_of_2(4u, 2u) == 4);
+static_assert(round_down_to_multiple_of_power_of_2(5u, 2u) == 4);
+
+static_assert(round_down_to_multiple_of_power_of_2(31u, 16u) == 16);
+static_assert(round_down_to_multiple_of_power_of_2(32u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(33u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(37u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(39u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(45u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(47u, 16u) == 32);
+static_assert(round_down_to_multiple_of_power_of_2(48u, 16u) == 48);
+static_assert(round_down_to_multiple_of_power_of_2(49u, 16u) == 48);
+
+template<class T, class U, std::enable_if_t<std::is_unsigned_v<T>, int> = 0, std::enable_if_t<std::is_unsigned_v<U>, int> = 0>
+constexpr inline T round_up_to_multiple_of_power_of_2(T a, U b) noexcept {
+ auto mod = mod_by_power_of_2(a, b);
+ return (mod == 0 ? a : a - mod + b);
+}
+
+static_assert(round_up_to_multiple_of_power_of_2(0u, 1u) == 0);
+static_assert(round_up_to_multiple_of_power_of_2(1u, 1u) == 1);
+static_assert(round_up_to_multiple_of_power_of_2(19u, 1u) == 19);
+
+static_assert(round_up_to_multiple_of_power_of_2(0u, 2u) == 0);
+static_assert(round_up_to_multiple_of_power_of_2(1u, 2u) == 2);
+static_assert(round_up_to_multiple_of_power_of_2(2u, 2u) == 2);
+static_assert(round_up_to_multiple_of_power_of_2(3u, 2u) == 4);
+static_assert(round_up_to_multiple_of_power_of_2(4u, 2u) == 4);
+static_assert(round_up_to_multiple_of_power_of_2(5u, 2u) == 6);
+
+static_assert(round_up_to_multiple_of_power_of_2(31u, 16u) == 32);
+static_assert(round_up_to_multiple_of_power_of_2(32u, 16u) == 32);
+static_assert(round_up_to_multiple_of_power_of_2(33u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(37u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(39u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(45u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(47u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(48u, 16u) == 48);
+static_assert(round_up_to_multiple_of_power_of_2(49u, 16u) == 64);
+
+} // namespace seastar::fs
diff --git a/src/fs/cluster.hh b/src/fs/cluster.hh
new file mode 100644
index 00000000..a35ce323
--- /dev/null
+++ b/src/fs/cluster.hh
@@ -0,0 +1,42 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+
+namespace seastar::fs {
+
+using cluster_id_t = uint64_t;
+using cluster_range = range<cluster_id_t>;
+
+inline cluster_id_t offset_to_cluster_id(disk_offset_t offset, unit_size_t cluster_size) noexcept {
+ assert(is_power_of_2(cluster_size));
+ return div_by_power_of_2(offset, cluster_size);
+}
+
+inline disk_offset_t cluster_id_to_offset(cluster_id_t cluster_id, unit_size_t cluster_size) noexcept {
+ assert(is_power_of_2(cluster_size));
+ return mul_by_power_of_2(cluster_id, cluster_size);
+}
+
+} // namespace seastar::fs
diff --git a/src/fs/inode.hh b/src/fs/inode.hh
new file mode 100644
index 00000000..aabc8d00
--- /dev/null
+++ b/src/fs/inode.hh
@@ -0,0 +1,80 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+
+#include <cstdint>
+#include <optional>
+
+namespace seastar::fs {
+
+// Last log2(fs_shards_pool_size bits) of the inode number contain the id of shard that owns the inode
+using inode_t = uint64_t;
+
+// Obtains shard id of the shard owning @p inode.
+//@p fs_shards_pool_size is the number of file system shards rounded up to a power of 2
+inline fs_shard_id_t inode_to_shard_no(inode_t inode, fs_shard_id_t fs_shards_pool_size) noexcept {
+ assert(is_power_of_2(fs_shards_pool_size));
+ return mod_by_power_of_2(inode, fs_shards_pool_size);
+}
+
+// Returns inode belonging to the shard owning @p shard_previous_inode that is next after @p shard_previous_inode
+// (i.e. the lowest inode greater than @p shard_previous_inode belonging to the same shard)
+//@p fs_shards_pool_size is the number of file system shards rounded up to a power of 2
+inline inode_t shard_next_inode(inode_t shard_previous_inode, fs_shard_id_t fs_shards_pool_size) noexcept {
+ return shard_previous_inode + fs_shards_pool_size;
+}
+
+// Returns first inode (lowest by value) belonging to the shard @p fs_shard_id
+inline inode_t shard_first_inode(fs_shard_id_t fs_shard_id) noexcept {
+ return fs_shard_id;
+}
+
+class shard_inode_allocator {
+ fs_shard_id_t _fs_shards_pool_size;
+ fs_shard_id_t _fs_shard_id;
+ std::optional<inode_t> _latest_allocated_inode;
+
+public:
+ shard_inode_allocator(fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id, std::optional<inode_t> latest_allocated_inode = std::nullopt)
+ : _fs_shards_pool_size(fs_shards_pool_size)
+ , _fs_shard_id(fs_shard_id)
+ , _latest_allocated_inode(latest_allocated_inode) {}
+
+ inode_t alloc() noexcept {
+ if (not _latest_allocated_inode) {
+ _latest_allocated_inode = shard_first_inode(_fs_shard_id);
+ return *_latest_allocated_inode;
+ }
+
+ _latest_allocated_inode = shard_next_inode(*_latest_allocated_inode, _fs_shards_pool_size);
+ return *_latest_allocated_inode;
+ }
+
+ void reset(std::optional<inode_t> latest_allocated_inode = std::nullopt) noexcept {
+ _latest_allocated_inode = latest_allocated_inode;
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/range.hh b/src/fs/range.hh
new file mode 100644
index 00000000..ef0c6756
--- /dev/null
+++ b/src/fs/range.hh
@@ -0,0 +1,61 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include <algorithm>
+
+namespace seastar::fs {
+
+template <class T>
+struct range {
+ T beg;
+ T end; // exclusive
+
+ constexpr bool is_empty() const noexcept { return beg >= end; }
+
+ constexpr T size() const noexcept { return end - beg; }
+};
+
+template <class T>
+range(T beg, T end) -> range<T>;
+
+template <class T>
+inline bool operator==(range<T> a, range<T> b) noexcept {
+ return (a.beg == b.beg and a.end == b.end);
+}
+
+template <class T>
+inline bool operator!=(range<T> a, range<T> b) noexcept {
+ return not (a == b);
+}
+
+template <class T>
+inline range<T> intersection(range<T> a, range<T> b) noexcept {
+ return {std::max(a.beg, b.beg), std::min(a.end, b.end)};
+}
+
+template <class T>
+inline bool are_intersecting(range<T> a, range<T> b) noexcept {
+ return (std::max(a.beg, b.beg) < std::min(a.end, b.end));
+}
+
+} // namespace seastar::fs
diff --git a/src/fs/units.hh b/src/fs/units.hh
new file mode 100644
index 00000000..1fc6754b
--- /dev/null
+++ b/src/fs/units.hh
@@ -0,0 +1,40 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/range.hh"
+
+#include <cstdint>
+
+namespace seastar::fs {
+
+using unit_size_t = uint32_t;
+
+using disk_offset_t = uint64_t;
+using disk_range = range<disk_offset_t>;
+
+using file_offset_t = uint64_t;
+using file_range = range<file_offset_t>;
+
+using fs_shard_id_t = uint32_t;
+
+} // namespace seastar::fs
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 39d11ad8..8ad08c7a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -660,7 +660,12 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/block_device.hh
include/seastar/fs/file.hh
include/seastar/fs/temporary_file.hh
+ src/fs/bitwise.hh
+ src/fs/cluster.hh
src/fs/file.cc
+ src/fs/inode.hh
+ src/fs/range.hh
+ src/fs/units.hh
)
endif()

--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:25 AM4/20/20
to seast...@googlegroups.com, Wojciech Mitros, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com
From: Wojciech Mitros <wmi...@protonmail.com>

Disk space is divided into segments of set size, called clusters. Each shard of
the filesystem will be assigned a set of clusters. Cluster allocator is the tool
that enables allocating and freeing clusters from that set.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---
src/fs/cluster_allocator.hh | 50 ++++++++++++++++++++++++++++++++++
src/fs/cluster_allocator.cc | 54 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 2 ++
3 files changed, 106 insertions(+)
create mode 100644 src/fs/cluster_allocator.hh
create mode 100644 src/fs/cluster_allocator.cc

diff --git a/src/fs/cluster_allocator.hh b/src/fs/cluster_allocator.hh
new file mode 100644
index 00000000..ef4f30b9
--- /dev/null
+++ b/src/fs/cluster_allocator.hh
@@ -0,0 +1,50 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/cluster.hh"
+
+#include <deque>
+#include <optional>
+#include <unordered_set>
+
+namespace seastar {
+
+namespace fs {
+
+class cluster_allocator {
+ std::unordered_set<cluster_id_t> _allocated_clusters;
+ std::deque<cluster_id_t> _free_clusters;
+
+public:
+ explicit cluster_allocator(std::unordered_set<cluster_id_t> allocated_clusters, std::deque<cluster_id_t> free_clusters);
+
+ // Tries to allocate a cluster
+ std::optional<cluster_id_t> alloc();
+
+ // @p cluster_id has to be allocated using alloc()
+ void free(cluster_id_t cluster_id);
+};
+
+}
+
+}
diff --git a/src/fs/cluster_allocator.cc b/src/fs/cluster_allocator.cc
new file mode 100644
index 00000000..c436c7ba
--- /dev/null
+++ b/src/fs/cluster_allocator.cc
@@ -0,0 +1,54 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "fs/cluster.hh"
+#include "fs/cluster_allocator.hh"
+
+#include <cassert>
+#include <optional>
+
+namespace seastar {
+
+namespace fs {
+
+cluster_allocator::cluster_allocator(std::unordered_set<cluster_id_t> allocated_clusters, std::deque<cluster_id_t> free_clusters)
+ : _allocated_clusters(std::move(allocated_clusters)), _free_clusters(std::move(free_clusters)) {}
+
+std::optional<cluster_id_t> cluster_allocator::alloc() {
+ if (_free_clusters.empty()) {
+ return std::nullopt;
+ }
+
+ cluster_id_t cluster_id = _free_clusters.front();
+ _free_clusters.pop_front();
+ _allocated_clusters.insert(cluster_id);
+ return cluster_id;
+}
+
+void cluster_allocator::free(cluster_id_t cluster_id) {
+ assert(_allocated_clusters.count(cluster_id) == 1);
+ _free_clusters.emplace_back(cluster_id);
+ _allocated_clusters.erase(cluster_id);
+}
+
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8ad08c7a..891201a3 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -662,6 +662,8 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
src/fs/cluster.hh
+ src/fs/cluster_allocator.cc
+ src/fs/cluster_allocator.hh
src/fs/file.cc
src/fs/inode.hh
src/fs/range.hh
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:26 AM4/20/20
to seast...@googlegroups.com, Wojciech Mitros, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com
From: Wojciech Mitros <wmi...@protonmail.com>

Added tests checking whether the cluster allocator works correctly in ordinary
and corner (e.g. trying to alloc with no free clusters) cases.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---
tests/unit/fs_cluster_allocator_test.cc | 115 ++++++++++++++++++++++++
tests/unit/CMakeLists.txt | 3 +
2 files changed, 118 insertions(+)
create mode 100644 tests/unit/fs_cluster_allocator_test.cc

diff --git a/tests/unit/fs_cluster_allocator_test.cc b/tests/unit/fs_cluster_allocator_test.cc
new file mode 100644
index 00000000..3650254e
--- /dev/null
+++ b/tests/unit/fs_cluster_allocator_test.cc
@@ -0,0 +1,115 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#define BOOST_TEST_MODULE fs
+
+#include "fs/cluster_allocator.hh"
+
+#include <boost/test/included/unit_test.hpp>
+#include <deque>
+#include <seastar/core/units.hh>
+#include <unordered_set>
+
+using namespace seastar;
+
+BOOST_AUTO_TEST_CASE(test_cluster_0) {
+ fs::cluster_allocator ca({}, {0});
+ BOOST_REQUIRE_EQUAL(ca.alloc().value(), 0);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ ca.free(0);
+ BOOST_REQUIRE_EQUAL(ca.alloc().value(), 0);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+ BOOST_REQUIRE(ca.alloc() == std::nullopt);
+}
+
+BOOST_AUTO_TEST_CASE(test_empty) {
+ fs::cluster_allocator empty_ca{{}, {}};
+ BOOST_REQUIRE(empty_ca.alloc() == std::nullopt);
+}
+
+BOOST_AUTO_TEST_CASE(test_small) {
+ std::deque<fs::cluster_id_t> deq{1, 5, 3, 4, 2};
+ fs::cluster_allocator small_ca({}, deq);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[0]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[1]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+
+ small_ca.free(deq[2]);
+ small_ca.free(deq[1]);
+ small_ca.free(deq[3]);
+ small_ca.free(deq[0]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[4]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[1]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[0]);
+ BOOST_REQUIRE(small_ca.alloc() == std::nullopt);
+
+ small_ca.free(deq[2]);
+ small_ca.free(deq[4]);
+ small_ca.free(deq[3]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[4]);
+ small_ca.free(deq[2]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[3]);
+ small_ca.free(deq[4]);
+ BOOST_REQUIRE_EQUAL(small_ca.alloc().value(), deq[2]);
+}
+
+BOOST_AUTO_TEST_CASE(test_max) {
+ constexpr fs::cluster_id_t clusters_per_shard = 1024;
+ std::deque<fs::cluster_id_t> deq;
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ deq.emplace_back(i);
+ }
+ fs::cluster_allocator ordinary_ca({}, deq);
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ BOOST_REQUIRE_EQUAL(ordinary_ca.alloc().value(), i);
+ }
+ BOOST_REQUIRE(ordinary_ca.alloc() == std::nullopt);
+ for (fs::cluster_id_t i = 0; i < clusters_per_shard; i++) {
+ ordinary_ca.free(i);
+ }
+}
+
+BOOST_AUTO_TEST_CASE(test_pseudo_rand) {
+ std::unordered_set<fs::cluster_id_t> uset;
+ std::deque<fs::cluster_id_t> deq;
+ fs::cluster_id_t elem = 215;
+ while (elem != 806) {
+ deq.emplace_back(elem);
+ elem = (elem * 215) % 1021;
+ }
+ elem = 1;
+ while (elem != 1020) {
+ uset.insert(elem);
+ elem = (elem * 19) % 1021;
+ }
+ fs::cluster_allocator random_ca(uset, deq);
+ elem = 215;
+ while (elem != 1) {
+ BOOST_REQUIRE_EQUAL(random_ca.alloc().value(), elem);
+ random_ca.free(1021-elem);
+ elem = (elem * 215) % 1021;
+ }
+}
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index 21e564fb..b2669e0a 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -365,6 +365,9 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_app_test (fs_block_device
SOURCES fs_block_device_test.cc
LIBRARIES seastar_testing)
+ seastar_add_test (fs_cluster_allocator
+ KIND BOOST
+ SOURCES fs_cluster_allocator_test.cc)

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:27 AM4/20/20
to seast...@googlegroups.com, Michał Niciejewski, sa...@scylladb.com, ank...@gmail.com, wmi...@protonmail.com
From: Michał Niciejewski <qup...@gmail.com>

Corner case tests:
- simple valid tests for reading and writing bootstrap record
- valid and invalid number of shards (range
[1, bootstrap_record::max_shards_nb] is valid)
- invalid crc in read record
- invalid magic number in read record
- invalid information about filesystem shards:
* id of the first metadata log cluster not in available cluster range
* invalid cluster range
* overlapping available cluster ranges for two different shards
* invalid alignment
* invalid cluster size

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---
tests/unit/fs_mock_block_device.hh | 55 ++++
tests/unit/fs_bootstrap_record_test.cc | 414 +++++++++++++++++++++++++
tests/unit/fs_mock_block_device.cc | 50 +++
tests/unit/CMakeLists.txt | 4 +
4 files changed, 523 insertions(+)
create mode 100644 tests/unit/fs_mock_block_device.hh
create mode 100644 tests/unit/fs_bootstrap_record_test.cc
create mode 100644 tests/unit/fs_mock_block_device.cc

diff --git a/tests/unit/fs_mock_block_device.hh b/tests/unit/fs_mock_block_device.hh
new file mode 100644
index 00000000..08da1491
--- /dev/null
+++ b/tests/unit/fs_mock_block_device.hh
@@ -0,0 +1,55 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB Ltd.
+ */
+
+#pragma once
+
+#include <cstring>
+#include <seastar/fs/block_device.hh>
+
+namespace seastar::fs {
+
+class mock_block_device_impl : public block_device_impl {
+public:
+ using buf_type = basic_sstring<uint8_t, size_t, 32, false>;
+ buf_type buf;
+ ~mock_block_device_impl() override = default;
+
+ struct write_operation {
+ uint64_t disk_offset;
+ temporary_buffer<uint8_t> data;
+ };
+
+ std::vector<write_operation> writes;
+
+ future<size_t> write(uint64_t pos, const void* buffer, size_t len, const io_priority_class&) override;
+
+ future<size_t> read(uint64_t pos, void* buffer, size_t len, const io_priority_class&) noexcept override;
+
+ future<> flush() noexcept override {
+ return make_ready_future<>();
+ }
+
+ future<> close() noexcept override {
+ return make_ready_future<>();
+ }
+};
+
+} // seastar::fs
diff --git a/tests/unit/fs_bootstrap_record_test.cc b/tests/unit/fs_bootstrap_record_test.cc
new file mode 100644
index 00000000..9994f5ee
--- /dev/null
+++ b/tests/unit/fs_bootstrap_record_test.cc
@@ -0,0 +1,414 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "fs/bootstrap_record.hh"
+#include "fs/cluster.hh"
+#include "fs/crc.hh"
+#include "fs_mock_block_device.hh"
+
+#include <boost/crc.hpp>
+#include <cstring>
+#include <seastar/core/print.hh>
+#include <seastar/fs/block_device.hh>
+#include <seastar/testing/test_case.hh>
+#include <seastar/testing/test_runner.hh>
+#include <seastar/testing/thread_test_case.hh>
+
+using namespace seastar;
+using namespace seastar::fs;
+
+namespace {
+
+inline std::vector<bootstrap_record::shard_info> prepare_valid_shards_info(uint32_t size) {
+ std::vector<bootstrap_record::shard_info> ret(size);
+ cluster_id_t curr = 1;
+ for (bootstrap_record::shard_info& info : ret) {
+ info.available_clusters = {curr, curr + 1};
+ info.metadata_cluster = curr;
+ curr++;
+ }
+ return ret;
+};
+
+inline void repair_crc32(shared_ptr<mock_block_device_impl> dev_impl) noexcept {
+ mock_block_device_impl::buf_type& buff = dev_impl.get()->buf;
+ constexpr size_t crc_pos = offsetof(bootstrap_record_disk, crc);
+ const uint32_t crc_new = crc32(buff.data(), crc_pos);
+ std::memcpy(buff.data() + crc_pos, &crc_new, sizeof(crc_new));
+}
+
+inline void change_byte_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset) noexcept {
+ dev_impl.get()->buf[offset] ^= 1;
+}
+
+template<typename T>
+inline void place_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset, T value) noexcept {
+ std::memcpy(dev_impl.get()->buf.data() + offset, &value, sizeof(value));
+}
+
+template<>
+inline void place_at_offset(shared_ptr<mock_block_device_impl> dev_impl, size_t offset,
+ std::vector<bootstrap_record::shard_info> shards_info) noexcept {
+ bootstrap_record::shard_info shards_info_disk[bootstrap_record::max_shards_nb];
+ std::memset(shards_info_disk, 0, sizeof(shards_info_disk));
+ std::copy(shards_info.begin(), shards_info.end(), shards_info_disk);
+
+ std::memcpy(dev_impl.get()->buf.data() + offset, shards_info_disk, sizeof(shards_info_disk));
+}
+
+inline bool check_exception_message(const invalid_bootstrap_record& ex, const sstring& message) {
+ return sstring(ex.what()).find(message) != sstring::npos;
+}
+
+const bootstrap_record default_write_record(1, bootstrap_record::min_alignment * 4,
+ bootstrap_record::min_alignment * 8, 1, {{6, {6, 9}}, {9, {9, 12}}, {12, {12, 15}}});
+
+}
+
+
+
+BOOST_TEST_DONT_PRINT_LOG_VALUE(bootstrap_record)
+
+SEASTAR_THREAD_TEST_CASE(valid_basic_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+SEASTAR_THREAD_TEST_CASE(valid_max_shards_nb_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(bootstrap_record::max_shards_nb);
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+SEASTAR_THREAD_TEST_CASE(valid_one_shard_test) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(1);
+
+ write_record.write_to_disk(dev).get();
+ const bootstrap_record read_record = bootstrap_record::read_from_disk(dev).get0();
+ BOOST_REQUIRE_EQUAL(write_record, read_record);
+}
+
+
+
+SEASTAR_THREAD_TEST_CASE(invalid_crc_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t crc_offset = offsetof(bootstrap_record_disk, crc);
+
+ write_record.write_to_disk(dev).get();
+ change_byte_at_offset(dev_impl, crc_offset);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid CRC");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_magic_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t magic_offset = offsetof(bootstrap_record_disk, magic);
+
+ write_record.write_to_disk(dev).get();
+ change_byte_at_offset(dev_impl, magic_offset);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid magic number");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_shards_info_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t shards_nb_offset = offsetof(bootstrap_record_disk, shards_nb);
+ constexpr size_t shards_info_offset = offsetof(bootstrap_record_disk, shards_info);
+
+ // shards_nb > max_shards_nb
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, shards_nb_offset, bootstrap_record::max_shards_nb + 1);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Shards number should be smaller or equal to {}", bootstrap_record::max_shards_nb));
+ });
+
+ // shards_nb == 0
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, shards_nb_offset, 0);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Shards number should be greater than 0");
+ });
+
+ std::vector<bootstrap_record::shard_info> shards_info;
+
+ // metadata_cluster not in available_clusters range
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {2, 3}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ write_record.write_to_disk(dev).get();
+ shards_info = {{3, {2, 3}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ // available_clusters.beg > available_clusters.end
+ write_record.write_to_disk(dev).get();
+ shards_info = {{3, {4, 2}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters.beg == available_clusters.end
+ write_record.write_to_disk(dev).get();
+ shards_info = {{2, {2, 2}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters contains cluster 0
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {0, 5}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Range of available clusters should not contain cluster 0");
+ });
+
+ // available_clusters overlap
+ write_record.write_to_disk(dev).get();
+ shards_info = {{1, {1, 3}}, {2, {2, 4}}};
+ place_at_offset(dev_impl, shards_nb_offset, shards_info.size());
+ place_at_offset(dev_impl, shards_info_offset, shards_info);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster ranges should not overlap");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_alignment_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t alignment_offset = offsetof(bootstrap_record_disk, alignment);
+
+ // alignment not power of 2
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, alignment_offset, bootstrap_record::min_alignment + 1);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Alignment should be a power of 2");
+ });
+
+ // alignment smaller than 512
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, alignment_offset, bootstrap_record::min_alignment / 2);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Alignment should be greater or equal to {}", bootstrap_record::min_alignment));
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_cluster_size_read) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ const bootstrap_record write_record = default_write_record;
+
+ constexpr size_t cluster_size_offset = offsetof(bootstrap_record_disk, cluster_size);
+
+ // cluster_size not divisible by alignment
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, cluster_size_offset, write_record.alignment / 2);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be divisible by alignment");
+ });
+
+ // cluster_size not power of 2
+ write_record.write_to_disk(dev).get();
+ place_at_offset(dev_impl, cluster_size_offset, write_record.alignment * 3);
+ repair_crc32(dev_impl);
+ BOOST_CHECK_EXCEPTION(bootstrap_record::read_from_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be a power of 2");
+ });
+}
+
+
+
+SEASTAR_THREAD_TEST_CASE(invalid_shards_info_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // shards_nb > max_shards_nb
+ write_record = default_write_record;
+ write_record.shards_info = prepare_valid_shards_info(bootstrap_record::max_shards_nb + 1);
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Shards number should be smaller or equal to {}", bootstrap_record::max_shards_nb));
+ });
+
+ // shards_nb == 0
+ write_record = default_write_record;
+ write_record.shards_info.clear();
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Shards number should be greater than 0");
+ });
+
+ // metadata_cluster not in available_clusters range
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {2, 3}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ write_record = default_write_record;
+ write_record.shards_info = {{3, {2, 3}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster with metadata should be inside available cluster range");
+ });
+
+ // available_clusters.beg > available_clusters.end
+ write_record = default_write_record;
+ write_record.shards_info = {{3, {4, 2}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters.beg == available_clusters.end
+ write_record = default_write_record;
+ write_record.shards_info = {{2, {2, 2}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Invalid cluster range");
+ });
+
+ // available_clusters contains cluster 0
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {0, 5}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Range of available clusters should not contain cluster 0");
+ });
+
+ // available_clusters overlap
+ write_record = default_write_record;
+ write_record.shards_info = {{1, {1, 3}}, {2, {2, 4}}};
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster ranges should not overlap");
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_alignment_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // alignment not power of 2
+ write_record = default_write_record;
+ write_record.alignment = bootstrap_record::min_alignment + 1;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Alignment should be a power of 2");
+ });
+
+ // alignment smaller than bootstrap_record::min_alignment
+ write_record = default_write_record;
+ write_record.alignment = bootstrap_record::min_alignment / 2;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, fmt::format("Alignment should be greater or equal to {}", bootstrap_record::min_alignment));
+ });
+}
+
+SEASTAR_THREAD_TEST_CASE(invalid_cluster_size_write) {
+ auto dev_impl = make_shared<mock_block_device_impl>();
+ block_device dev(dev_impl);
+ bootstrap_record write_record = default_write_record;
+
+ // cluster_size not divisible by alignment
+ write_record = default_write_record;
+ write_record.cluster_size = write_record.alignment / 2;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be divisible by alignment");
+ });
+
+ // cluster_size not power of 2
+ write_record = default_write_record;
+ write_record.cluster_size = write_record.alignment * 3;
+ BOOST_CHECK_EXCEPTION(write_record.write_to_disk(dev).get(), invalid_bootstrap_record,
+ [] (const invalid_bootstrap_record& ex) {
+ return check_exception_message(ex, "Cluster size should be a power of 2");
+ });
+}
diff --git a/tests/unit/fs_mock_block_device.cc b/tests/unit/fs_mock_block_device.cc
new file mode 100644
index 00000000..6f83587e
--- /dev/null
+++ b/tests/unit/fs_mock_block_device.cc
@@ -0,0 +1,50 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB Ltd.
+ */
+
+#include "fs_mock_block_device.hh"
+
+namespace seastar::fs {
+
+namespace {
+logger mlogger("fs_mock_block_device");
+} // namespace
+
+future<size_t> mock_block_device_impl::write(uint64_t pos, const void* buffer, size_t len, const io_priority_class&) {
+ mlogger.debug("write({}, ..., {})", pos, len);
+ writes.emplace_back(write_operation {
+ pos,
+ temporary_buffer<uint8_t>(static_cast<const uint8_t*>(buffer), len)
+ });
+ if (buf.size() < pos + len)
+ buf.resize(pos + len);
+ std::memcpy(buf.data() + pos, buffer, len);
+ return make_ready_future<size_t>(len);
+}
+
+future<size_t> mock_block_device_impl::read(uint64_t pos, void* buffer, size_t len, const io_priority_class&) noexcept {
+ mlogger.debug("read({}, ..., {})", pos, len);
+ if (buf.size() < pos + len)
+ buf.resize(pos + len);
+ std::memcpy(buffer, buf.c_str() + pos, len);
+ return make_ready_future<size_t>(len);
+}
+
+} // seastar::fs
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index b2669e0a..f9591046 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -365,6 +365,10 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_app_test (fs_block_device
SOURCES fs_block_device_test.cc
LIBRARIES seastar_testing)
+ seastar_add_test (fs_bootstrap_record
+ SOURCES
+ fs_bootstrap_record_test.cc
+ fs_mock_block_device.cc)
seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:27 AM4/20/20
to seast...@googlegroups.com, Michał Niciejewski, sa...@scylladb.com, ank...@gmail.com, wmi...@protonmail.com
From: Michał Niciejewski <qup...@gmail.com>

Bootstrap record serves the same role as the superblock in other
filesystems.
It contains basic information essential to properly bootstrap the
filesystem:
- filesystem version
- alignment used for data writes
- cluster size
- inode number of the root directory
- information needed to bootstrap every shard of the filesystem:
* id of the first metadata log cluster
* range of available clusters for data and metadata

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---
src/fs/bootstrap_record.hh | 98 ++++++++++++++++++
src/fs/crc.hh | 34 ++++++
src/fs/bootstrap_record.cc | 206 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 3 +
4 files changed, 341 insertions(+)
create mode 100644 src/fs/bootstrap_record.hh
create mode 100644 src/fs/crc.hh
create mode 100644 src/fs/bootstrap_record.cc

diff --git a/src/fs/bootstrap_record.hh b/src/fs/bootstrap_record.hh
new file mode 100644
index 00000000..ee15295a
--- /dev/null
+++ b/src/fs/bootstrap_record.hh
@@ -0,0 +1,98 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/cluster.hh"
+#include "fs/inode.hh"
+#include "seastar/fs/block_device.hh"
+
+#include <exception>
+
+namespace seastar::fs {
+
+class invalid_bootstrap_record : public std::runtime_error {
+public:
+ explicit invalid_bootstrap_record(const std::string& msg) : std::runtime_error(msg) {}
+ explicit invalid_bootstrap_record(const char* msg) : std::runtime_error(msg) {}
+};
+
+/// In-memory version of the record describing characteristics of the file system (~superblock).
+class bootstrap_record {
+public:
+ static constexpr uint64_t magic_number = 0x5343594c4c414653; // SCYLLAFS
+ static constexpr uint32_t max_shards_nb = 500;
+ static constexpr unit_size_t min_alignment = 4096;
+
+ struct shard_info {
+ cluster_id_t metadata_cluster; /// cluster id of the first metadata log cluster
+ cluster_range available_clusters; /// range of clusters for data for this shard
+ };
+
+ uint64_t version; /// file system version
+ unit_size_t alignment; /// write alignment in bytes
+ unit_size_t cluster_size; /// cluster size in bytes
+ inode_t root_directory; /// root dir inode number
+ std::vector<shard_info> shards_info; /// basic informations about each file system shard
+
+ bootstrap_record() = default;
+ bootstrap_record(uint64_t version, unit_size_t alignment, unit_size_t cluster_size, inode_t root_directory,
+ std::vector<shard_info> shards_info)
+ : version(version), alignment(alignment), cluster_size(cluster_size) , root_directory(root_directory)
+ , shards_info(std::move(shards_info)) {}
+
+ /// number of file system shards
+ uint32_t shards_nb() const noexcept {
+ return shards_info.size();
+ }
+
+ static future<bootstrap_record> read_from_disk(block_device& device);
+ future<> write_to_disk(block_device& device) const;
+
+ friend bool operator==(const bootstrap_record&, const bootstrap_record&) noexcept;
+ friend bool operator!=(const bootstrap_record&, const bootstrap_record&) noexcept;
+};
+
+inline bool operator==(const bootstrap_record::shard_info& lhs, const bootstrap_record::shard_info& rhs) noexcept {
+ return lhs.metadata_cluster == rhs.metadata_cluster and lhs.available_clusters == rhs.available_clusters;
+}
+
+inline bool operator!=(const bootstrap_record::shard_info& lhs, const bootstrap_record::shard_info& rhs) noexcept {
+ return !(lhs == rhs);
+}
+
+inline bool operator!=(const bootstrap_record& lhs, const bootstrap_record& rhs) noexcept {
+ return !(lhs == rhs);
+}
+
+/// On-disk version of the record describing characteristics of the file system (~superblock).
+struct bootstrap_record_disk {
+ uint64_t magic;
+ uint64_t version;
+ unit_size_t alignment;
+ unit_size_t cluster_size;
+ inode_t root_directory;
+ uint32_t shards_nb;
+ bootstrap_record::shard_info shards_info[bootstrap_record::max_shards_nb];
+ uint32_t crc;
+};
+
+}
diff --git a/src/fs/crc.hh b/src/fs/crc.hh
new file mode 100644
index 00000000..da557323
--- /dev/null
+++ b/src/fs/crc.hh
@@ -0,0 +1,34 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include <boost/crc.hpp>
+
+namespace seastar::fs {
+
+inline uint32_t crc32(const void* buff, size_t len) noexcept {
+ boost::crc_32_type result;
+ result.process_bytes(buff, len);
+ return result.checksum();
+}
+
+}
diff --git a/src/fs/bootstrap_record.cc b/src/fs/bootstrap_record.cc
new file mode 100644
index 00000000..a342efb6
--- /dev/null
+++ b/src/fs/bootstrap_record.cc
@@ -0,0 +1,206 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#include "fs/bootstrap_record.hh"
+#include "fs/crc.hh"
+#include "seastar/core/print.hh"
+#include "seastar/core/units.hh"
+
+namespace seastar::fs {
+
+namespace {
+
+constexpr unit_size_t write_alignment = 4 * KB;
+constexpr disk_offset_t bootstrap_record_offset = 0;
+
+constexpr size_t aligned_bootstrap_record_size =
+ (1 + (sizeof(bootstrap_record_disk) - 1) / write_alignment) * write_alignment;
+constexpr size_t crc_offset = offsetof(bootstrap_record_disk, crc);
+
+inline std::optional<invalid_bootstrap_record> check_alignment(unit_size_t alignment) {
+ if (!is_power_of_2(alignment)) {
+ return invalid_bootstrap_record(fmt::format("Alignment should be a power of 2, read alignment '{}'",
+ alignment));
+ }
+ if (alignment < bootstrap_record::min_alignment) {
+ return invalid_bootstrap_record(fmt::format("Alignment should be greater or equal to {}, read alignment '{}'",
+ bootstrap_record::min_alignment, alignment));
+ }
+ return std::nullopt;
+}
+
+inline std::optional<invalid_bootstrap_record> check_cluster_size(unit_size_t cluster_size, unit_size_t alignment) {
+ if (!is_power_of_2(cluster_size)) {
+ return invalid_bootstrap_record(fmt::format("Cluster size should be a power of 2, read cluster size '{}'", cluster_size));
+ }
+ if (cluster_size % alignment != 0) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster size should be divisible by alignment, read alignment '{}', read cluster size '{}'",
+ alignment, cluster_size));
+ }
+ return std::nullopt;
+}
+
+inline std::optional<invalid_bootstrap_record> check_shards_number(uint32_t shards_nb) {
+ if (shards_nb == 0) {
+ return invalid_bootstrap_record(fmt::format("Shards number should be greater than 0, read shards number '{}'",
+ shards_nb));
+ }
+ if (shards_nb > bootstrap_record::max_shards_nb) {
+ return invalid_bootstrap_record(fmt::format(
+ "Shards number should be smaller or equal to {}, read shards number '{}'",
+ bootstrap_record::max_shards_nb, shards_nb));
+ }
+ return std::nullopt;
+}
+
+std::optional<invalid_bootstrap_record> check_shards_info(std::vector<bootstrap_record::shard_info> shards_info) {
+ // check 1 <= beg <= metadata_cluster < end
+ for (const bootstrap_record::shard_info& info : shards_info) {
+ if (info.available_clusters.beg >= info.available_clusters.end) {
+ return invalid_bootstrap_record(fmt::format("Invalid cluster range, read cluster range [{}, {})",
+ info.available_clusters.beg, info.available_clusters.end));
+ }
+ if (info.available_clusters.beg == 0) {
+ return invalid_bootstrap_record(fmt::format(
+ "Range of available clusters should not contain cluster 0, read cluster range [{}, {})",
+ info.available_clusters.beg, info.available_clusters.end));
+ }
+ if (info.available_clusters.beg > info.metadata_cluster ||
+ info.available_clusters.end <= info.metadata_cluster) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster with metadata should be inside available cluster range, read cluster range [{}, {}), read metadata cluster '{}'",
+ info.available_clusters.beg, info.available_clusters.end, info.metadata_cluster));
+ }
+ }
+
+ // check that ranges don't overlap
+ sort(shards_info.begin(), shards_info.end(),
+ [] (const bootstrap_record::shard_info& left,
+ const bootstrap_record::shard_info& right) {
+ return left.available_clusters.beg < right.available_clusters.beg;
+ });
+ for (size_t i = 1; i < shards_info.size(); i++) {
+ if (shards_info[i - 1].available_clusters.end > shards_info[i].available_clusters.beg) {
+ return invalid_bootstrap_record(fmt::format(
+ "Cluster ranges should not overlap, overlaping ranges [{}, {}), [{}, {})",
+ shards_info[i - 1].available_clusters.beg, shards_info[i - 1].available_clusters.end,
+ shards_info[i].available_clusters.beg, shards_info[i].available_clusters.end));
+ }
+ }
+ return std::nullopt;
+}
+
+}
+
+future<bootstrap_record> bootstrap_record::read_from_disk(block_device& device) {
+ auto bootstrap_record_buff = temporary_buffer<char>::aligned(write_alignment, aligned_bootstrap_record_size);
+ return device.read(bootstrap_record_offset, bootstrap_record_buff.get_write(), aligned_bootstrap_record_size)
+ .then([bootstrap_record_buff = std::move(bootstrap_record_buff)] (size_t ret) {
+ if (ret != aligned_bootstrap_record_size) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format(
+ "Error while reading bootstrap record block, {} bytes read instead of {}",
+ ret, aligned_bootstrap_record_size)));
+ }
+
+ bootstrap_record_disk bootstrap_record_disk;
+ std::memcpy(&bootstrap_record_disk, bootstrap_record_buff.get(), sizeof(bootstrap_record_disk));
+
+ const uint32_t crc_calc = crc32(bootstrap_record_buff.get(), crc_offset);
+ if (crc_calc != bootstrap_record_disk.crc) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format("Invalid CRC, expected crc '{}', read crc '{}'",
+ crc_calc, bootstrap_record_disk.crc)));
+ }
+ if (magic_number != bootstrap_record_disk.magic) {
+ return make_exception_future<bootstrap_record>(
+ invalid_bootstrap_record(fmt::format("Invalid magic number, expected magic '{}', read magic '{}'",
+ magic_number, bootstrap_record_disk.magic)));
+ }
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_alignment(bootstrap_record_disk.alignment)) ||
+ (ret_check = check_cluster_size(bootstrap_record_disk.cluster_size, bootstrap_record_disk.alignment)) ||
+ (ret_check = check_shards_number(bootstrap_record_disk.shards_nb))) {
+ return make_exception_future<bootstrap_record>(ret_check.value());
+ }
+
+ const std::vector<shard_info> tmp_shards_info(bootstrap_record_disk.shards_info,
+ bootstrap_record_disk.shards_info + bootstrap_record_disk.shards_nb);
+
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_shards_info(tmp_shards_info))) {
+ return make_exception_future<bootstrap_record>(ret_check.value());
+ }
+
+ bootstrap_record bootstrap_record_mem(bootstrap_record_disk.version,
+ bootstrap_record_disk.alignment,
+ bootstrap_record_disk.cluster_size,
+ bootstrap_record_disk.root_directory,
+ std::move(tmp_shards_info));
+
+ return make_ready_future<bootstrap_record>(std::move(bootstrap_record_mem));
+ });
+}
+
+future<> bootstrap_record::write_to_disk(block_device& device) const {
+ // initial checks
+ if (std::optional<invalid_bootstrap_record> ret_check;
+ (ret_check = check_alignment(alignment)) ||
+ (ret_check = check_cluster_size(cluster_size, alignment)) ||
+ (ret_check = check_shards_number(shards_nb())) ||
+ (ret_check = check_shards_info(shards_info))) {
+ return make_exception_future<>(ret_check.value());
+ }
+
+ auto bootstrap_record_buff = temporary_buffer<char>::aligned(write_alignment, aligned_bootstrap_record_size);
+ std::memset(bootstrap_record_buff.get_write(), 0, aligned_bootstrap_record_size);
+ bootstrap_record_disk* bootstrap_record_disk = (struct bootstrap_record_disk*)bootstrap_record_buff.get_write();
+
+ // prepare bootstrap_record_disk records
+ bootstrap_record_disk->magic = bootstrap_record::magic_number;
+ bootstrap_record_disk->version = version;
+ bootstrap_record_disk->alignment = alignment;
+ bootstrap_record_disk->cluster_size = cluster_size;
+ bootstrap_record_disk->root_directory = root_directory;
+ bootstrap_record_disk->shards_nb = shards_nb();
+ std::copy(shards_info.begin(), shards_info.end(), bootstrap_record_disk->shards_info);
+ bootstrap_record_disk->crc = crc32(bootstrap_record_disk, crc_offset);
+
+ return device.write(bootstrap_record_offset, bootstrap_record_buff.get(), aligned_bootstrap_record_size)
+ .then([bootstrap_record_buff = std::move(bootstrap_record_buff)] (size_t ret) {
+ if (ret != aligned_bootstrap_record_size) {
+ return make_exception_future<>(
+ invalid_bootstrap_record(fmt::format(
+ "Error while writing bootstrap record block to disk, {} bytes written instead of {}",
+ ret, aligned_bootstrap_record_size)));
+ }
+ return make_ready_future<>();
+ });
+}
+
+bool operator==(const bootstrap_record& lhs, const bootstrap_record& rhs) noexcept {
+ return lhs.version == rhs.version and lhs.alignment == rhs.alignment and
+ lhs.cluster_size == rhs.cluster_size and lhs.root_directory == rhs.root_directory and
+ lhs.shards_info == rhs.shards_info;
+}
+
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 891201a3..ca994d42 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -661,9 +661,12 @@ if (Seastar_EXPERIMENTAL_FS)
include/seastar/fs/file.hh
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
+ src/fs/bootstrap_record.cc
+ src/fs/bootstrap_record.hh
src/fs/cluster.hh
src/fs/cluster_allocator.cc
src/fs/cluster_allocator.hh
+ src/fs/crc.hh

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:28 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
overloaded an useful wrapper that simplifies usage of std:visit over
std::variant. It allows matching variants by type using lambdas in
a similar way that functional languages use.
For details see: https://en.cppreference.com/w/cpp/utility/variant/visit#Example

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/overloaded.hh | 26 ++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 27 insertions(+)
create mode 100644 include/seastar/fs/overloaded.hh

diff --git a/include/seastar/fs/overloaded.hh b/include/seastar/fs/overloaded.hh
new file mode 100644
index 00000000..2a205ba3
--- /dev/null
+++ b/include/seastar/fs/overloaded.hh
@@ -0,0 +1,26 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+// Taken from: https://en.cppreference.com/w/cpp/utility/variant/visit
+template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
+template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
diff --git a/CMakeLists.txt b/CMakeLists.txt
index ca994d42..be3f921f 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -659,6 +659,7 @@ if (Seastar_EXPERIMENTAL_FS)
# SeastarFS source files
include/seastar/fs/block_device.hh
include/seastar/fs/file.hh
+ include/seastar/fs/overloaded.hh
include/seastar/fs/temporary_file.hh
src/fs/bitwise.hh
src/fs/bootstrap_record.cc
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:29 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
path.hh provides extract_last_component() function that extracts the
last component of the provided path

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/path.hh | 42 ++++++++++++++++++
tests/unit/fs_path_test.cc | 90 ++++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
tests/unit/CMakeLists.txt | 3 ++
4 files changed, 136 insertions(+)
create mode 100644 src/fs/path.hh
create mode 100644 tests/unit/fs_path_test.cc

diff --git a/src/fs/path.hh b/src/fs/path.hh
new file mode 100644
index 00000000..9da4c517
--- /dev/null
+++ b/src/fs/path.hh
@@ -0,0 +1,42 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include <string>
+
+namespace seastar::fs {
+
+// Extracts the last component in @p path. WARNING: The last component is empty iff @p path is empty or ends with '/'
+inline std::string extract_last_component(std::string& path) {
+ auto beg = path.find_last_of('/');
+ if (beg == path.npos) {
+ std::string res = std::move(path);
+ path = {};
+ return res;
+ }
+
+ auto res = path.substr(beg + 1);
+ path.resize(beg + 1);
+ return res;
+}
+
+} // namespace seastar::fs
diff --git a/tests/unit/fs_path_test.cc b/tests/unit/fs_path_test.cc
new file mode 100644
index 00000000..956e64d7
--- /dev/null
+++ b/tests/unit/fs_path_test.cc
@@ -0,0 +1,90 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#include "fs/path.hh"
+
+#define BOOST_TEST_MODULE fs
+#include <boost/test/included/unit_test.hpp>
+
+using namespace seastar::fs;
+
+BOOST_AUTO_TEST_CASE(last_component_simple) {
+ {
+ std::string str = "";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "/";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "/");
+ }
+ {
+ std::string str = "/foo/bar.txt";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "bar.txt");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/.bar";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".bar");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/bar/";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "");
+ BOOST_REQUIRE_EQUAL(str, "/foo/bar/");
+ }
+ {
+ std::string str = "/foo/.";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "/foo/..";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "..");
+ BOOST_REQUIRE_EQUAL(str, "/foo/");
+ }
+ {
+ std::string str = "bar.txt";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "bar.txt");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = ".bar";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".bar");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = ".";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), ".");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "..";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "..");
+ BOOST_REQUIRE_EQUAL(str, "");
+ }
+ {
+ std::string str = "//host";
+ BOOST_REQUIRE_EQUAL(extract_last_component(str), "host");
+ BOOST_REQUIRE_EQUAL(str, "//");
+ }
+}
diff --git a/CMakeLists.txt b/CMakeLists.txt
index be3f921f..fb8fe32c 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -670,6 +670,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/crc.hh
src/fs/file.cc
src/fs/inode.hh
+ src/fs/path.hh
src/fs/range.hh
src/fs/units.hh
)
diff --git a/tests/unit/CMakeLists.txt b/tests/unit/CMakeLists.txt
index f9591046..07551b0b 100644
--- a/tests/unit/CMakeLists.txt
+++ b/tests/unit/CMakeLists.txt
@@ -372,6 +372,9 @@ if (Seastar_EXPERIMENTAL_FS)
seastar_add_test (fs_cluster_allocator
KIND BOOST
SOURCES fs_cluster_allocator_test.cc)
+ seastar_add_test (fs_path
+ KIND BOOST
+ SOURCES fs_path_test.cc)

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:30 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
value shared lock is allows to lock (using shared_mutex) a specified value.
One operation locks only one value, but value shared lock allows you to
maintain locks on different values in one place. Also locking is
"on demand" i.e. corresponding shared_mutex will not be created unless a
lock will be used on value and will be deleted as soon as the value is
not being locked by anyone. It serves as a dynamic pool of shared_mutexes
acquired on demand.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/value_shared_lock.hh | 65 +++++++++++++++++++++++++++++++++++++
CMakeLists.txt | 1 +
2 files changed, 66 insertions(+)
create mode 100644 src/fs/value_shared_lock.hh

diff --git a/src/fs/value_shared_lock.hh b/src/fs/value_shared_lock.hh
new file mode 100644
index 00000000..6c7a3adf
--- /dev/null
+++ b/src/fs/value_shared_lock.hh
@@ -0,0 +1,65 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/shared_mutex.hh"
+
+#include <map>
+
+namespace seastar::fs {
+
+template<class Value>
+class value_shared_lock {
+ struct lock_info {
+ size_t users_num = 0;
+ shared_mutex lock;
+ };
+
+ std::map<Value, lock_info> _locks;
+
+public:
+ value_shared_lock() = default;
+
+ template<class Func>
+ auto with_shared_on(const Value& val, Func&& func) {
+ auto it = _locks.emplace(val, lock_info {}).first;
+ ++it->second.users_num;
+ return with_shared(it->second.lock, std::forward<Func>(func)).finally([this, it] {
+ if (--it->second.users_num == 0) {
+ _locks.erase(it);
+ }
+ });
+ }
+
+ template<class Func>
+ auto with_lock_on(const Value& val, Func&& func) {
+ auto it = _locks.emplace(val, lock_info {}).first;
+ ++it->second.users_num;
+ return with_lock(it->second.lock, std::forward<Func>(func)).finally([this, it] {
+ if (--it->second.users_num == 0) {
+ _locks.erase(it);
+ }
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/CMakeLists.txt b/CMakeLists.txt
index fb8fe32c..8a59eca6 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -673,6 +673,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/path.hh
src/fs/range.hh
src/fs/units.hh
+ src/fs/value_shared_lock.hh
)
endif()

--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:32 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
Creating unlinked file may be useful as temporary file or to expose the
file via path only after the file is filled with contents.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/metadata_disk_entries.hh | 51 +++++++++++-
src/fs/metadata_log.hh | 6 ++
src/fs/metadata_log_bootstrap.hh | 2 +
.../create_and_open_unlinked_file.hh | 77 +++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 5 ++
src/fs/metadata_log.cc | 21 +++++
src/fs/metadata_log_bootstrap.cc | 13 ++++
CMakeLists.txt | 1 +
8 files changed, 175 insertions(+), 1 deletion(-)
create mode 100644 src/fs/metadata_log_operations/create_and_open_unlinked_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 44c2a1c7..437c2c2b 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -27,10 +27,52 @@

namespace seastar::fs {

+struct ondisk_unix_metadata {
+ uint32_t perms;
+ uint32_t uid;
+ uint32_t gid;
+ uint64_t btime_ns;
+ uint64_t mtime_ns;
+ uint64_t ctime_ns;
+} __attribute__((packed));
+
+static_assert(sizeof(decltype(ondisk_unix_metadata::perms)) >= sizeof(decltype(unix_metadata::perms)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::uid)) >= sizeof(decltype(unix_metadata::uid)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::gid)) >= sizeof(decltype(unix_metadata::gid)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::btime_ns)) >= sizeof(decltype(unix_metadata::btime_ns)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::mtime_ns)) >= sizeof(decltype(unix_metadata::mtime_ns)));
+static_assert(sizeof(decltype(ondisk_unix_metadata::ctime_ns)) >= sizeof(decltype(unix_metadata::ctime_ns)));
+
+inline unix_metadata ondisk_metadata_to_metadata(const ondisk_unix_metadata& ondisk_metadata) noexcept {
+ unix_metadata res;
+ static_assert(sizeof(ondisk_metadata) == 36,
+ "metadata size changed: check if above static asserts and below assignments need update");
+ res.perms = static_cast<file_permissions>(ondisk_metadata.perms);
+ res.uid = ondisk_metadata.uid;
+ res.gid = ondisk_metadata.gid;
+ res.btime_ns = ondisk_metadata.btime_ns;
+ res.mtime_ns = ondisk_metadata.mtime_ns;
+ res.ctime_ns = ondisk_metadata.ctime_ns;
+ return res;
+}
+
+inline ondisk_unix_metadata metadata_to_ondisk_metadata(const unix_metadata& metadata) noexcept {
+ ondisk_unix_metadata res;
+ static_assert(sizeof(res) == 36, "metadata size changed: check if below assignments need update");
+ res.perms = static_cast<decltype(res.perms)>(metadata.perms);
+ res.uid = metadata.uid;
+ res.gid = metadata.gid;
+ res.btime_ns = metadata.btime_ns;
+ res.mtime_ns = metadata.mtime_ns;
+ res.ctime_ns = metadata.ctime_ns;
+ return res;
+}
+
enum ondisk_type : uint8_t {
INVALID = 0,
CHECKPOINT,
NEXT_METADATA_CLUSTER,
+ CREATE_INODE,
};

struct ondisk_checkpoint {
@@ -54,9 +96,16 @@ struct ondisk_next_metadata_cluster {
cluster_id_t cluster_id; // metadata log continues there
} __attribute__((packed));

+struct ondisk_create_inode {
+ inode_t inode;
+ uint8_t is_directory;
+ ondisk_unix_metadata metadata;
+} __attribute__((packed));
+
template<typename T>
constexpr size_t ondisk_entry_size(const T& entry) noexcept {
- static_assert(std::is_same_v<T, ondisk_next_metadata_cluster>, "ondisk entry size not defined for given type");
+ static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or
+ std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index c10852a3..6f069c13 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -156,6 +156,8 @@ class metadata_log {

friend class metadata_log_bootstrap;

+ friend class create_and_open_unlinked_file_operation;
+
public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
shared_ptr<metadata_to_disk_buffer> cluster_buff);
@@ -176,6 +178,8 @@ class metadata_log {
return _inodes.count(inode) != 0;
}

+ inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
+
template<class Func>
void schedule_background_task(Func&& task) {
_background_futures = when_all_succeed(_background_futures.get_future(), std::forward<Func>(task));
@@ -286,6 +290,8 @@ class metadata_log {
// Returns size of the file or throws exception iff @p inode is invalid
file_offset_t file_size(inode_t inode) const;

+ future<inode_t> create_and_open_unlinked_file(file_permissions perms);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 5da79631..4a1fa7e9 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -115,6 +115,8 @@ class metadata_log_bootstrap {

bool inode_exists(inode_t inode);

+ future<> bootstrap_create_inode();
+
public:
static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
diff --git a/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh b/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
new file mode 100644
index 00000000..79c5e9f2
--- /dev/null
+++ b/src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
@@ -0,0 +1,77 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "seastar/core/future.hh"
+
+namespace seastar::fs {
+
+class create_and_open_unlinked_file_operation {
+ metadata_log& _metadata_log;
+
+ create_and_open_unlinked_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<inode_t> create_and_open_unlinked_file(file_permissions perms) {
+ using namespace std::chrono;
+ uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ unix_metadata unx_mtdt = {
+ perms,
+ 0, // TODO: Eventually, we'll want a user to be able to pass his credentials when bootstrapping the
+ 0, // file system -- that will allow us to authorize users on startup (e.g. via LDAP or whatnot).
+ curr_time_ns,
+ curr_time_ns,
+ curr_time_ns
+ };
+
+ inode_t new_inode = _metadata_log._inode_allocator.alloc();
+ ondisk_create_inode ondisk_entry {
+ new_inode,
+ false,
+ metadata_to_ondisk_metadata(unx_mtdt)
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future<inode_t>(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future<inode_t>(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ inode_info& new_inode_info = _metadata_log.memory_only_create_inode(new_inode, false, unx_mtdt);
+ // We don't have to lock, as there was no context switch since the allocation of the inode number
+ ++new_inode_info.opened_files_count;
+ return make_ready_future<inode_t>(new_inode);
+ }
+ __builtin_unreachable();
+ }
+
+public:
+ static future<inode_t> perform(metadata_log& metadata_log, file_permissions perms) {
+ return do_with(create_and_open_unlinked_file_operation(metadata_log),
+ [perms = std::move(perms)](auto& obj) {
+ return obj.create_and_open_unlinked_file(std::move(perms));
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index bd60f4f3..593ad46a 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -152,6 +152,11 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
}

public:
+ [[nodiscard]] virtual append_result append(const ondisk_create_inode& create_inode) noexcept {
+ // TODO: maybe add a constexpr static field to each ondisk_* entry specifying what type it is?
+ return append_simple(CREATE_INODE, create_inode);
+ }
+
using to_disk_buffer::flush_to_disk;
};

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 6e29f2e5..be523fc7 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -26,6 +26,7 @@
#include "fs/metadata_disk_entries.hh"
#include "fs/metadata_log.hh"
#include "fs/metadata_log_bootstrap.hh"
+#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"
@@ -80,6 +81,22 @@ future<> metadata_log::shutdown() {
});
}

+inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata) {
+ assert(_inodes.count(inode) == 0);
+ return _inodes.emplace(inode, inode_info {
+ 0,
+ 0,
+ metadata,
+ [&]() -> decltype(inode_info::contents) {
+ if (is_directory) {
+ return inode_info::directory {};
+ }
+
+ return inode_info::file {};
+ }()
+ }).first->second;
+}
+
void metadata_log::schedule_flush_of_curr_cluster() {
// Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {
@@ -213,6 +230,10 @@ file_offset_t metadata_log::file_size(inode_t inode) const {
}, it->second.contents);
}

+future<inode_t> metadata_log::create_and_open_unlinked_file(file_permissions perms) {
+ return create_and_open_unlinked_file_operation::perform(*this, std::move(perms));
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 926d79fe..702e0e34 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -211,6 +211,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return invalid_entry_exception();
case NEXT_METADATA_CLUSTER:
return bootstrap_next_metadata_cluster();
+ case CREATE_INODE:
+ return bootstrap_create_inode();
}

// unknown type => metadata log corruption
@@ -242,6 +244,17 @@ bool metadata_log_bootstrap::inode_exists(inode_t inode) {
return _metadata_log._inodes.count(inode) != 0;
}

+future<> metadata_log_bootstrap::bootstrap_create_inode() {
+ ondisk_create_inode entry;
+ if (not _curr_checkpoint.read_entry(entry) or inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ _metadata_log.memory_only_create_inode(entry.inode, entry.is_directory,
+ ondisk_metadata_to_metadata(entry.metadata));
+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
// Clear the metadata log
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 19666a8a..3304a02b 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -677,6 +677,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log.hh
src/fs/metadata_log_bootstrap.cc
src/fs/metadata_log_bootstrap.hh
+ src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_to_disk_buffer.hh
src/fs/path.hh
src/fs/range.hh
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:33 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
SeastarFS is a log-structured filesystem. Every shard will have 3
private logs:
- metadata log
- medium data log
- big data log (this is not actually a log, but in the big picture it
looks like it was)

Disk space is divided into clusters (typically around several MiB) that
have all equal size that is multiple of alignment (typically 4096
bytes). Each shard has its private pool of clusters (assignment is
stored in bootstrap record). Each log consumes clusters one by one -- it
writes the current one and if cluster becomes full, then log switches to
a new one that is obtained from a pool of free clusters managed by
cluster_allocator. Metadata log and medium data log write data in the
same manner: they fill up the cluster gradually from left to right. Big
data log takes a cluster and completely fills it with data at once -- it
is only used during big writes.

This commit adds the skeleton of the metadata log:
- data structures for holding metadata in memory with all operations on
this data structure i.e. manipulating files and their contents
- locking logic (detailed description can be found in metadata_log.hh)
- buffers for writting logs to disk (one for metadata and one for medium
data)
- basic higher level interface e.g. path lookup, iterating over
directory
- boostraping metadata log == reading metadata log from disk and
reconstructing shard's filesystem structure from just before shutdown

File content is stored as a set of data vectors that may have one of
three kinds: in memory data, on disk data, hole. Small writes are
writted directly to the metadata log and because all metadata is stored
in the memory these writes are also in memory, therefore in-memory kind.
Medium and large data are not stored in memory, so they are represented
using on-disk kind. Enlarging file via truncate may produce holes, hence
hole kind.

Directory entries are stored as metadata log entries -- directory inodes
have no content.

To disk buffers buffer data that will be written to disk. There are two
kinds: (normal) to disk buffer and metadata to disk buffer. The latter
is implemented using the former, but provides higher level interface for
appending metadata log entries rather than raw bytes.

Normal to disk buffer appends data sequentially, but if a flush occurs
the offset where next data will be appended is aligned up to alignment
to ensure that writes to the same cluster are non-overlaping.

Metadata to disk buffer appends data using normal to disk buffer but
does some formatting along the way. The structure of the metadata log on
disk is as follows:
| checkpoint_1 | entry_1, entry_2, ..., entry_n | checkpoint_2 | ... |
| <---- checkpointed data -----> |
etc. Every batch of metadata_log entries is preceded by a checkpoint
entry. Appending metadata log appends the current batch of entries.
Flushing or lack of space ends current batch of entries and then
checkpoint entry is updated (because it holds CRC code of all
checkpointed data) and then write of the whole batch is requested and a
new checkpoint (if there is space for that) is started. Last checkpoint
in a cluster contains a special entry pointing to the next cluster that
is utilized by the metadata log.

Bootstraping is, in fact, just replying of all actions from metadata log
that were saved on disk. It works as follows:
- reads metadata log clusters one by one
- for each cluster, until the last checkpoint contains pointer to the
next cluster, processes the checkpoint and entries it checkpoints
- processing works as follows:
- checkpoint entry is read and if it is invalid it means that the
metadata log ends here (last checkpoint was partially written or the
metadata log really ended here or there was some data corruption...)
and we stop
- if it is correct, it contains the length of the checkpointed data
(metadata log entries), so then we process all of them (error there
indicates that there was data corruption but CRC is still somehow
correct, so we abort all bootstraping with an error)

Locking is to ensure that concurrent modifications of the metadata do
not corrupt it. E.g. Creating a file is a complex operation: you have
to create inode and add a directory entry that will represent this inode
with a path and write corresponding metadata log entries to the disk.
Simultaneous attempts of creating the same file could corrupt the file
system. Not to mention concurrent create and unlink on the same path...
Thus careful and robust locking mechanism is used. For details see
metadata_log.hh.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
include/seastar/fs/exceptions.hh | 88 +++++++++
src/fs/inode_info.hh | 221 ++++++++++++++++++++++
src/fs/metadata_disk_entries.hh | 63 +++++++
src/fs/metadata_log.hh | 295 ++++++++++++++++++++++++++++++
src/fs/metadata_log_bootstrap.hh | 123 +++++++++++++
src/fs/metadata_to_disk_buffer.hh | 158 ++++++++++++++++
src/fs/to_disk_buffer.hh | 138 ++++++++++++++
src/fs/unix_metadata.hh | 40 ++++
src/fs/metadata_log.cc | 222 ++++++++++++++++++++++
src/fs/metadata_log_bootstrap.cc | 264 ++++++++++++++++++++++++++
CMakeLists.txt | 10 +
11 files changed, 1622 insertions(+)
create mode 100644 include/seastar/fs/exceptions.hh
create mode 100644 src/fs/inode_info.hh
create mode 100644 src/fs/metadata_disk_entries.hh
create mode 100644 src/fs/metadata_log.hh
create mode 100644 src/fs/metadata_log_bootstrap.hh
create mode 100644 src/fs/metadata_to_disk_buffer.hh
create mode 100644 src/fs/to_disk_buffer.hh
create mode 100644 src/fs/unix_metadata.hh
create mode 100644 src/fs/metadata_log.cc
create mode 100644 src/fs/metadata_log_bootstrap.cc

diff --git a/include/seastar/fs/exceptions.hh b/include/seastar/fs/exceptions.hh
new file mode 100644
index 00000000..9941f557
--- /dev/null
+++ b/include/seastar/fs/exceptions.hh
@@ -0,0 +1,88 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include <exception>
+
+namespace seastar::fs {
+
+struct fs_exception : public std::exception {
+ const char* what() const noexcept override = 0;
+};
+
+struct cluster_size_too_small_to_perform_operation_exception : public std::exception {
+ const char* what() const noexcept override { return "Cluster size is too small to perform operation"; }
+};
+
+struct invalid_inode_exception : public fs_exception {
+ const char* what() const noexcept override { return "Invalid inode"; }
+};
+
+struct invalid_argument_exception : public fs_exception {
+ const char* what() const noexcept override { return "Invalid argument"; }
+};
+
+struct operation_became_invalid_exception : public fs_exception {
+ const char* what() const noexcept override { return "Operation became invalid"; }
+};
+
+struct no_more_space_exception : public fs_exception {
+ const char* what() const noexcept override { return "No more space on device"; }
+};
+
+struct file_already_exists_exception : public fs_exception {
+ const char* what() const noexcept override { return "File already exists"; }
+};
+
+struct filename_too_long_exception : public fs_exception {
+ const char* what() const noexcept override { return "Filename too long"; }
+};
+
+struct is_directory_exception : public fs_exception {
+ const char* what() const noexcept override { return "Is a directory"; }
+};
+
+struct directory_not_empty_exception : public fs_exception {
+ const char* what() const noexcept override { return "Directory is not empty"; }
+};
+
+struct path_lookup_exception : public fs_exception {
+ const char* what() const noexcept override = 0;
+};
+
+struct path_is_not_absolute_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "Path is not absolute"; }
+};
+
+struct invalid_path_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "Path is invalid"; }
+};
+
+struct no_such_file_or_directory_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "No such file or directory"; }
+};
+
+struct path_component_not_directory_exception : public path_lookup_exception {
+ const char* what() const noexcept override { return "A component used as a directory is not a directory"; }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/inode_info.hh b/src/fs/inode_info.hh
new file mode 100644
index 00000000..89bc71d8
--- /dev/null
+++ b/src/fs/inode_info.hh
@@ -0,0 +1,221 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/inode.hh"
+#include "fs/units.hh"
+#include "fs/unix_metadata.hh"
+#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/overloaded.hh"
+
+#include <map>
+#include <variant>
+
+namespace seastar::fs {
+
+struct inode_data_vec {
+ file_range data_range; // data spans [beg, end) range of the file
+
+ struct in_mem_data {
+ temporary_buffer<uint8_t> data;
+ };
+
+ struct on_disk_data {
+ file_offset_t device_offset;
+ };
+
+ struct hole_data { };
+
+ std::variant<in_mem_data, on_disk_data, hole_data> data_location;
+
+ // TODO: rename that function to something more suitable
+ inode_data_vec share_copy() {
+ inode_data_vec shared;
+ shared.data_range = data_range;
+ std::visit(overloaded {
+ [&](inode_data_vec::in_mem_data& mem) {
+ shared.data_location = inode_data_vec::in_mem_data {mem.data.share()};
+ },
+ [&](inode_data_vec::on_disk_data& disk_data) {
+ shared.data_location = disk_data;
+ },
+ [&](inode_data_vec::hole_data&) {
+ shared.data_location = inode_data_vec::hole_data {};
+ },
+ }, data_location);
+ return shared;
+ }
+};
+
+struct inode_info {
+ uint32_t opened_files_count = 0; // Number of open files referencing inode
+ uint32_t directories_containing_file = 0;
+ unix_metadata metadata;
+
+ struct directory {
+ // TODO: directory entry cannot contain '/' character --> add checks for that
+ std::map<std::string, inode_t, std::less<>> entries; // entry name => inode
+ };
+
+ struct file {
+ std::map<file_offset_t, inode_data_vec> data; // file offset => data vector that begins there (data vectors
+ // do not overlap)
+
+ file_offset_t size() const noexcept {
+ return (data.empty() ? 0 : data.rbegin()->second.data_range.end);
+ }
+
+ // Deletes data vectors that are subset of @p data_range and cuts overlapping data vectors to make them
+ // not overlap. @p cut_data_vec_processor is called on each inode_data_vec (including parts of overlapped
+ // data vectors) that will be deleted
+ template<class Func>
+ void cut_out_data_range(file_range range, Func&& cut_data_vec_processor) {
+ static_assert(std::is_invocable_v<Func, inode_data_vec>);
+ // Cut all vectors intersecting with range
+ auto it = data.lower_bound(range.beg);
+ if (it != data.begin() and are_intersecting(range, prev(it)->second.data_range)) {
+ --it;
+ }
+
+ while (it != data.end() and are_intersecting(range, it->second.data_range)) {
+ auto data_vec = std::move(data.extract(it++).mapped());
+ const auto cap = intersection(range, data_vec.data_range);
+ if (cap == data_vec.data_range) {
+ // Fully intersects => remove it
+ cut_data_vec_processor(std::move(data_vec));
+ continue;
+ }
+
+ // Overlaps => cut it, possibly into two parts:
+ // | data_vec |
+ // | cap |
+ // | left | mid | right |
+ // left and right remain, but mid is deleted
+ inode_data_vec left, mid, right;
+ left.data_range = {data_vec.data_range.beg, cap.beg};
+ mid.data_range = cap;
+ right.data_range = {cap.end, data_vec.data_range.end};
+ auto right_beg_shift = right.data_range.beg - data_vec.data_range.beg;
+ auto mid_beg_shift = mid.data_range.beg - data_vec.data_range.beg;
+ std::visit(overloaded {
+ [&](inode_data_vec::in_mem_data& mem) {
+ left.data_location = inode_data_vec::in_mem_data {mem.data.share(0, left.data_range.size())};
+ mid.data_location = inode_data_vec::in_mem_data {
+ mem.data.share(mid_beg_shift, mid.data_range.size())
+ };
+ right.data_location = inode_data_vec::in_mem_data {
+ mem.data.share(right_beg_shift, right.data_range.size())
+ };
+ },
+ [&](inode_data_vec::on_disk_data& disk_data) {
+ left.data_location = disk_data;
+ mid.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + mid_beg_shift};
+ right.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + right_beg_shift};
+ },
+ [&](inode_data_vec::hole_data&) {
+ left.data_location = inode_data_vec::hole_data {};
+ mid.data_location = inode_data_vec::hole_data {};
+ right.data_location = inode_data_vec::hole_data {};
+ },
+ }, data_vec.data_location);
+
+ // Save new data vectors
+ if (not left.data_range.is_empty()) {
+ data.emplace(left.data_range.beg, std::move(left));
+ }
+ if (not right.data_range.is_empty()) {
+ data.emplace(right.data_range.beg, std::move(right));
+ }
+
+ // Process deleted vector
+ cut_data_vec_processor(std::move(mid));
+ }
+ }
+
+ // Executes @p execute_on_data_ranges_processor on each data vector that is a subset of @p data_range.
+ // Data vectors on the edges are appropriately trimmed before passed to the function.
+ template<class Func>
+ void execute_on_data_range(file_range range, Func&& execute_on_data_range_processor) {
+ static_assert(std::is_invocable_v<Func, inode_data_vec>);
+ auto it = data.lower_bound(range.beg);
+ if (it != data.begin() and are_intersecting(range, prev(it)->second.data_range)) {
+ --it;
+ }
+
+ while (it != data.end() and are_intersecting(range, it->second.data_range)) {
+ auto& data_vec = (it++)->second;
+ const auto cap = intersection(range, data_vec.data_range);
+ if (cap == data_vec.data_range) {
+ // Fully intersects => execute
+ execute_on_data_range_processor(data_vec.share_copy());
+ continue;
+ }
+
+ inode_data_vec mid;
+ mid.data_range = std::move(cap);
+ auto mid_beg_shift = mid.data_range.beg - data_vec.data_range.beg;
+ std::visit(overloaded {
+ [&](inode_data_vec::in_mem_data& mem) {
+ mid.data_location = inode_data_vec::in_mem_data {
+ mem.data.share(mid_beg_shift, mid.data_range.size())
+ };
+ },
+ [&](inode_data_vec::on_disk_data& disk_data) {
+ mid.data_location = inode_data_vec::on_disk_data {disk_data.device_offset + mid_beg_shift};
+ },
+ [&](inode_data_vec::hole_data&) {
+ mid.data_location = inode_data_vec::hole_data {};
+ },
+ }, data_vec.data_location);
+
+ // Execute on middle range
+ execute_on_data_range_processor(std::move(mid));
+ }
+ }
+ };
+
+ std::variant<directory, file> contents;
+
+ bool is_linked() const noexcept {
+ return directories_containing_file != 0;
+ }
+
+ bool is_open() const noexcept {
+ return opened_files_count != 0;
+ }
+
+ constexpr bool is_directory() const noexcept { return std::holds_alternative<directory>(contents); }
+
+ // These are noexcept because invalid access is a bug not an error
+ constexpr directory& get_directory() & noexcept { return std::get<directory>(contents); }
+ constexpr const directory& get_directory() const & noexcept { return std::get<directory>(contents); }
+ constexpr directory&& get_directory() && noexcept { return std::move(std::get<directory>(contents)); }
+
+ constexpr bool is_file() const noexcept { return std::holds_alternative<file>(contents); }
+
+ // These are noexcept because invalid access is a bug not an error
+ constexpr file& get_file() & noexcept { return std::get<file>(contents); }
+ constexpr const file& get_file() const & noexcept { return std::get<file>(contents); }
+ constexpr file&& get_file() && noexcept { return std::move(std::get<file>(contents)); }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
new file mode 100644
index 00000000..44c2a1c7
--- /dev/null
+++ b/src/fs/metadata_disk_entries.hh
@@ -0,0 +1,63 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/cluster.hh"
+#include "fs/inode.hh"
+#include "fs/unix_metadata.hh"
+
+namespace seastar::fs {
+
+enum ondisk_type : uint8_t {
+ INVALID = 0,
+ CHECKPOINT,
+ NEXT_METADATA_CLUSTER,
+};
+
+struct ondisk_checkpoint {
+ // The disk format is as follows:
+ // | ondisk_checkpoint | .............................. |
+ // | data |
+ // |<-- checkpointed_data_length -->|
+ // ^
+ // ______________________________________________/
+ // /
+ // there ends checkpointed data and (next checkpoint begins or metadata in the current cluster end)
+ //
+ // CRC is calculated from byte sequence | data | checkpointed_data_length |
+ // E.g. if the data consist of bytes "abcd" and checkpointed_data_length of bytes "xyz" then the byte sequence
+ // would be "abcdxyz"
+ uint32_t crc32_code;
+ unit_size_t checkpointed_data_length;
+} __attribute__((packed));
+
+struct ondisk_next_metadata_cluster {
+ cluster_id_t cluster_id; // metadata log continues there
+} __attribute__((packed));
+
+template<typename T>
+constexpr size_t ondisk_entry_size(const T& entry) noexcept {
+ static_assert(std::is_same_v<T, ondisk_next_metadata_cluster>, "ondisk entry size not defined for given type");
+ return sizeof(ondisk_type) + sizeof(entry);
+}
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
new file mode 100644
index 00000000..c10852a3
--- /dev/null
+++ b/src/fs/metadata_log.hh
@@ -0,0 +1,295 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/cluster.hh"
+#include "fs/cluster_allocator.hh"
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_to_disk_buffer.hh"
+#include "fs/units.hh"
+#include "fs/unix_metadata.hh"
+#include "fs/value_shared_lock.hh"
+#include "seastar/core/file-types.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/shared_future.hh"
+#include "seastar/core/shared_ptr.hh"
+#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/exceptions.hh"
+
+#include <chrono>
+#include <cstddef>
+#include <exception>
+#include <type_traits>
+#include <utility>
+#include <variant>
+
+namespace seastar::fs {
+
+class metadata_log {
+ block_device _device;
+ const unit_size_t _cluster_size;
+ const unit_size_t _alignment;
+
+ // Takes care of writing current cluster of serialized metadata log entries to device
+ shared_ptr<metadata_to_disk_buffer> _curr_cluster_buff;
+ shared_future<> _background_futures = now();
+
+ // In memory metadata
+ cluster_allocator _cluster_allocator;
+ std::map<inode_t, inode_info> _inodes;
+ inode_t _root_dir;
+ shard_inode_allocator _inode_allocator;
+
+ // Locks are used to ensure metadata consistency while allowing concurrent usage.
+ //
+ // Whenever one wants to create or delete inode or directory entry, one has to acquire appropriate unique lock for
+ // the inode / dir entry that will appear / disappear and only after locking that operation should take place.
+ // Shared locks should be used only to ensure that an inode / dir entry won't disappear / appear, while some action
+ // is performed. Therefore, unique locks ensure that resource is not used by anyone else.
+ //
+ // IMPORTANT: if an operation needs to acquire more than one lock, it has to be done with *one* call to
+ // locks::with_locks() because it is ensured there that a deadlock-free locking order is used (for details see
+ // that function).
+ //
+ // Examples:
+ // - To create file we have to take shared lock (SL) on the directory to which we add a dir entry and
+ // unique lock (UL) on the added entry in this directory. SL is taken because the directory should not disappear.
+ // UL is taken, because we do not want the entry to appear while we are creating it.
+ // - To read or write to a file, a SL is acquired on its inode and then the operation is performed.
+ class locks {
+ value_shared_lock<inode_t> _inode_locks;
+ value_shared_lock<std::pair<inode_t, std::string>> _dir_entry_locks;
+
+ public:
+ struct shared {
+ inode_t inode;
+ std::optional<std::string> dir_entry;
+ };
+
+ template<class T>
+ static constexpr bool is_shared = std::is_same_v<std::remove_cv_t<std::remove_reference_t<T>>, shared>;
+
+ struct unique {
+ inode_t inode;
+ std::optional<std::string> dir_entry;
+ };
+
+ template<class T>
+ static constexpr bool is_unique = std::is_same_v<std::remove_cv_t<std::remove_reference_t<T>>, unique>;
+
+ template<class Kind, class Func>
+ auto with_lock(Kind kind, Func&& func) {
+ static_assert(is_shared<Kind> or is_unique<Kind>);
+ if constexpr (is_shared<Kind>) {
+ if (kind.dir_entry.has_value()) {
+ return _dir_entry_locks.with_shared_on({kind.inode, std::move(*kind.dir_entry)},
+ std::forward<Func>(func));
+ } else {
+ return _inode_locks.with_shared_on(kind.inode, std::forward<Func>(func));
+ }
+ } else {
+ if (kind.dir_entry.has_value()) {
+ return _dir_entry_locks.with_lock_on({kind.inode, std::move(*kind.dir_entry)},
+ std::forward<Func>(func));
+ } else {
+ return _inode_locks.with_lock_on(kind.inode, std::forward<Func>(func));
+ }
+ }
+ }
+
+ private:
+ template<class Kind1, class Kind2, class Func>
+ auto with_locks_in_order(Kind1 kind1, Kind2 kind2, Func func) {
+ // Func is not an universal reference because we will have to store it
+ return with_lock(std::move(kind1), [this, kind2 = std::move(kind2), func = std::move(func)] () mutable {
+ return with_lock(std::move(kind2), std::move(func));
+ });
+ };
+
+ public:
+
+ template<class Kind1, class Kind2, class Func>
+ auto with_locks(Kind1 kind1, Kind2 kind2, Func&& func) {
+ static_assert(is_shared<Kind1> or is_unique<Kind1>);
+ static_assert(is_shared<Kind2> or is_unique<Kind2>);
+
+ // Locking order is as follows: kind with lower tuple (inode, dir_entry) goes first.
+ // This order is linear and we always lock in one direction, so the graph of locking relations (A -> B iff
+ // lock on A is acquired and lock on B is acquired / being acquired) makes a DAG. Thus, deadlock is
+ // impossible, as it would require a cycle to appear.
+ std::pair<inode_t, std::optional<std::string>&> k1 {kind1.inode, kind1.dir_entry};
+ std::pair<inode_t, std::optional<std::string>&> k2 {kind2.inode, kind2.dir_entry};
+ if (k1 < k2) {
+ return with_locks_in_order(std::move(kind1), std::move(kind2), std::forward<Func>(func));
+ } else {
+ return with_locks_in_order(std::move(kind2), std::move(kind1), std::forward<Func>(func));
+ }
+ }
+ } _locks;
+
+ // TODO: for compaction: keep some set(?) of inode_data_vec, so that we can keep track of clusters that have lowest
+ // utilization (up-to-date data)
+ // TODO: for compaction: keep estimated metadata log size (that would take when written to disk) and
+ // the real size of metadata log taken on disk to allow for detecting when compaction
+
+ friend class metadata_log_bootstrap;
+
+public:
+ metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
+ shared_ptr<metadata_to_disk_buffer> cluster_buff);
+
+ metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment);
+
+ metadata_log(const metadata_log&) = delete;
+ metadata_log& operator=(const metadata_log&) = delete;
+ metadata_log(metadata_log&&) = default;
+
+ future<> bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
+ fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
+
+ future<> shutdown();
+
+private:
+ bool inode_exists(inode_t inode) const noexcept {
+ return _inodes.count(inode) != 0;
+ }
+
+ template<class Func>
+ void schedule_background_task(Func&& task) {
+ _background_futures = when_all_succeed(_background_futures.get_future(), std::forward<Func>(task));
+ }
+
+ void schedule_flush_of_curr_cluster();
+
+ enum class flush_result {
+ DONE,
+ NO_SPACE
+ };
+
+ [[nodiscard]] flush_result schedule_flush_of_curr_cluster_and_change_it_to_new_one();
+
+ future<> flush_curr_cluster();
+
+ enum class append_result {
+ APPENDED,
+ TOO_BIG,
+ NO_SPACE
+ };
+
+ template<class... Args>
+ [[nodiscard]] append_result append_ondisk_entry(Args&&... args) {
+ using AR = append_result;
+ // TODO: maybe check for errors on _background_futures to expose previous errors?
+ switch (_curr_cluster_buff->append(args...)) {
+ case metadata_to_disk_buffer::APPENDED:
+ return AR::APPENDED;
+ case metadata_to_disk_buffer::TOO_BIG:
+ break;
+ }
+
+ switch (schedule_flush_of_curr_cluster_and_change_it_to_new_one()) {
+ case flush_result::NO_SPACE:
+ return AR::NO_SPACE;
+ case flush_result::DONE:
+ break;
+ }
+
+ switch (_curr_cluster_buff->append(args...)) {
+ case metadata_to_disk_buffer::APPENDED:
+ return AR::APPENDED;
+ case metadata_to_disk_buffer::TOO_BIG:
+ return AR::TOO_BIG;
+ }
+
+ __builtin_unreachable();
+ }
+
+ enum class path_lookup_error {
+ NOT_ABSOLUTE, // a path is not absolute
+ NO_ENTRY, // no such file or directory
+ NOT_DIR, // a component used as a directory in path is not, in fact, a directory
+ };
+
+ std::variant<inode_t, path_lookup_error> do_path_lookup(const std::string& path) const noexcept;
+
+ // It is safe for @p path to be a temporary (there is no need to worry about its lifetime)
+ future<inode_t> path_lookup(const std::string& path) const;
+
+public:
+ template<class Func>
+ future<> iterate_directory(const std::string& dir_path, Func func) {
+ static_assert(std::is_invocable_r_v<future<>, Func, const std::string&> or
+ std::is_invocable_r_v<future<stop_iteration>, Func, const std::string&>);
+ auto convert_func = [&]() -> decltype(auto) {
+ if constexpr (std::is_invocable_r_v<future<stop_iteration>, Func, const std::string&>) {
+ return std::move(func);
+ } else {
+ return [func = std::move(func)]() -> future<stop_iteration> {
+ return func().then([] {
+ return stop_iteration::no;
+ });
+ };
+ }
+ };
+ return path_lookup(dir_path).then([this, func = convert_func()](inode_t dir_inode) {
+ return do_with(std::move(func), std::string {}, [this, dir_inode](auto& func, auto& prev_entry) {
+ auto it = _inodes.find(dir_inode);
+ if (it == _inodes.end()) {
+ return now(); // Directory disappeared
+ }
+ if (not it->second.is_directory()) {
+ return make_exception_future(path_component_not_directory_exception());
+ }
+
+ return repeat([this, dir_inode, &prev_entry, &func] {
+ auto it = _inodes.find(dir_inode);
+ if (it == _inodes.end()) {
+ return make_ready_future<stop_iteration>(stop_iteration::yes); // Directory disappeared
+ }
+ assert(it->second.is_directory() and "Directory cannot become a file");
+ auto& dir = it->second.get_directory();
+
+ auto entry_it = dir.entries.upper_bound(prev_entry);
+ if (entry_it == dir.entries.end()) {
+ return make_ready_future<stop_iteration>(stop_iteration::yes); // No more entries
+ }
+
+ prev_entry = entry_it->first;
+ return func(static_cast<const std::string&>(prev_entry));
+ });
+ });
+ });
+ }
+
+ // Returns size of the file or throws exception iff @p inode is invalid
+ file_offset_t file_size(inode_t inode) const;
+
+ // All disk-related errors will be exposed here
+ future<> flush_log() {
+ return flush_curr_cluster();
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
new file mode 100644
index 00000000..5da79631
--- /dev/null
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -0,0 +1,123 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/cluster.hh"
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_to_disk_buffer.hh"
+#include "fs/units.hh"
+#include "fs/metadata_log.hh"
+#include "seastar/core/do_with.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/temporary_buffer.hh"
+
+#include <boost/crc.hpp>
+#include <cstddef>
+#include <cstring>
+#include <unordered_set>
+#include <variant>
+
+namespace seastar::fs {
+
+// TODO: add a comment about what it is
+class data_reader {
+ const uint8_t* _data = nullptr;
+ size_t _size = 0;
+ size_t _pos = 0;
+ size_t _last_checkpointed_pos = 0;
+
+public:
+ data_reader() = default;
+
+ data_reader(const uint8_t* data, size_t size) : _data(data), _size(size) {}
+
+ size_t curr_pos() const noexcept { return _pos; }
+
+ size_t last_checkpointed_pos() const noexcept { return _last_checkpointed_pos; }
+
+ size_t bytes_left() const noexcept { return _size - _pos; }
+
+ void align_curr_pos(size_t alignment) noexcept { _pos = round_up_to_multiple_of_power_of_2(_pos, alignment); }
+
+ void checkpoint_curr_pos() noexcept { _last_checkpointed_pos = _pos; }
+
+ // Returns whether the reading was successful
+ bool read(void* destination, size_t size);
+
+ // Returns whether the reading was successful
+ template<class T>
+ bool read_entry(T& entry) noexcept {
+ return read(&entry, sizeof(entry));
+ }
+
+ // Returns whether the reading was successful
+ bool read_string(std::string& str, size_t size);
+
+ std::optional<temporary_buffer<uint8_t>> read_tmp_buff(size_t size);
+
+ // Returns whether the processing was successful
+ bool process_crc_without_reading(boost::crc_32_type& crc, size_t size);
+
+ std::optional<data_reader> extract(size_t size);
+};
+
+class metadata_log_bootstrap {
+ metadata_log& _metadata_log;
+ cluster_range _available_clusters;
+ std::unordered_set<cluster_id_t> _taken_clusters;
+ std::optional<cluster_id_t> _next_cluster;
+ temporary_buffer<uint8_t> _curr_cluster_data;
+ data_reader _curr_cluster;
+ data_reader _curr_checkpoint;
+
+ metadata_log_bootstrap(metadata_log& metadata_log, cluster_range available_clusters);
+
+ future<> bootstrap(cluster_id_t first_metadata_cluster_id, fs_shard_id_t fs_shards_pool_size,
+ fs_shard_id_t fs_shard_id);
+
+ future<> bootstrap_cluster(cluster_id_t curr_cluster);
+
+ static auto invalid_entry_exception() {
+ return make_exception_future<>(std::runtime_error("Invalid metadata log entry"));
+ }
+
+ future<> bootstrap_read_cluster();
+
+ // Returns whether reading and checking was successful
+ bool read_and_check_checkpoint();
+
+ future<> bootstrap_checkpointed_data();
+
+ future<> bootstrap_next_metadata_cluster();
+
+ bool inode_exists(inode_t inode);
+
+public:
+ static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
+ cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
new file mode 100644
index 00000000..bd60f4f3
--- /dev/null
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -0,0 +1,158 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/to_disk_buffer.hh"
+
+#include <boost/crc.hpp>
+
+namespace seastar::fs {
+
+// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
+// in order to finish construction.
+class metadata_to_disk_buffer : protected to_disk_buffer {
+ boost::crc_32_type _crc;
+
+public:
+ metadata_to_disk_buffer() = default;
+
+ using to_disk_buffer::init; // Explicitly stated that stays the same
+
+ virtual shared_ptr<metadata_to_disk_buffer> virtual_constructor() const {
+ return make_shared<metadata_to_disk_buffer>();
+ }
+
+ /**
+ * @brief Inits object, leaving it in state as if just after flushing with unflushed data end at
+ * @p cluster_beg_offset
+ *
+ * @param aligned_max_size size of the buffer, must be aligned
+ * @param alignment write alignment
+ * @param cluster_beg_offset disk offset of the beginning of the cluster
+ * @param metadata_end_pos position at which valid metadata ends: valid metadata range: [0, @p metadata_end_pos)
+ */
+ virtual void init_from_bootstrapped_cluster(size_t aligned_max_size, unit_size_t alignment,
+ disk_offset_t cluster_beg_offset, size_t metadata_end_pos) {
+ assert(is_power_of_2(alignment));
+ assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
+ assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
+ assert(aligned_max_size >= sizeof(ondisk_type) + sizeof(ondisk_checkpoint));
+ assert(alignment >= sizeof(ondisk_type) + sizeof(ondisk_checkpoint) + sizeof(ondisk_type) +
+ sizeof(ondisk_next_metadata_cluster) and
+ "We always need to be able to pack at least a checkpoint and next_metadata_cluster entry to the last "
+ "data flush in the cluster");
+ assert(metadata_end_pos < aligned_max_size);
+
+ _max_size = aligned_max_size;
+ _alignment = alignment;
+ _cluster_beg_offset = cluster_beg_offset;
+ auto aligned_pos = round_up_to_multiple_of_power_of_2(metadata_end_pos, _alignment);
+ _unflushed_data = {aligned_pos, aligned_pos};
+ _buff = decltype(_buff)::aligned(_alignment, _max_size);
+
+ start_new_unflushed_data();
+ }
+
+protected:
+ void start_new_unflushed_data() noexcept override {
+ if (bytes_left() < sizeof(ondisk_type) + sizeof(ondisk_checkpoint) + sizeof(ondisk_type) +
+ sizeof(ondisk_next_metadata_cluster)) {
+ assert(bytes_left() == 0); // alignment has to be big enough to hold checkpoint and next_metadata_cluster
+ return; // No more space
+ }
+
+ ondisk_type type = INVALID;
+ ondisk_checkpoint checkpoint;
+ std::memset(&checkpoint, 0, sizeof(checkpoint));
+
+ to_disk_buffer::append_bytes(&type, sizeof(type));
+ to_disk_buffer::append_bytes(&checkpoint, sizeof(checkpoint));
+
+ _crc.reset();
+ }
+
+ void prepare_unflushed_data_for_flush() noexcept override {
+ // Make checkpoint valid
+ constexpr ondisk_type checkpoint_type = CHECKPOINT;
+ size_t checkpoint_pos = _unflushed_data.beg + sizeof(checkpoint_type);
+ ondisk_checkpoint checkpoint;
+ checkpoint.checkpointed_data_length = _unflushed_data.end - checkpoint_pos - sizeof(checkpoint);
+ _crc.process_bytes(&checkpoint.checkpointed_data_length, sizeof(checkpoint.checkpointed_data_length));
+ checkpoint.crc32_code = _crc.checksum();
+
+ std::memcpy(_buff.get_write() + _unflushed_data.beg, &checkpoint_type, sizeof(checkpoint_type));
+ std::memcpy(_buff.get_write() + checkpoint_pos, &checkpoint, sizeof(checkpoint));
+ }
+
+public:
+ using to_disk_buffer::bytes_left_after_flush_if_done_now; // Explicitly stated that stays the same
+
+private:
+ void append_bytes(const void* data, size_t len) noexcept override {
+ to_disk_buffer::append_bytes(data, len);
+ _crc.process_bytes(data, len);
+ }
+
+public:
+ enum append_result {
+ APPENDED,
+ TOO_BIG,
+ };
+
+ [[nodiscard]] virtual append_result append(const ondisk_next_metadata_cluster& next_metadata_cluster) noexcept {
+ ondisk_type type = NEXT_METADATA_CLUSTER;
+ if (bytes_left() < ondisk_entry_size(next_metadata_cluster)) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&next_metadata_cluster, sizeof(next_metadata_cluster));
+ return APPENDED;
+ }
+
+ using to_disk_buffer::bytes_left;
+
+protected:
+ bool fits_for_append(size_t bytes_no) const noexcept {
+ // We need to reserve space for the next metadata cluster entry
+ return (bytes_left() >= bytes_no + sizeof(ondisk_type) + sizeof(ondisk_next_metadata_cluster));
+ }
+
+private:
+ template<class T>
+ [[nodiscard]] append_result append_simple(ondisk_type type, const T& entry) noexcept {
+ if (not fits_for_append(ondisk_entry_size(entry))) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&entry, sizeof(entry));
+ return APPENDED;
+ }
+
+public:
+ using to_disk_buffer::flush_to_disk;
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/to_disk_buffer.hh b/src/fs/to_disk_buffer.hh
new file mode 100644
index 00000000..612f26d2
--- /dev/null
+++ b/src/fs/to_disk_buffer.hh
@@ -0,0 +1,138 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/temporary_buffer.hh"
+#include "seastar/fs/block_device.hh"
+
+#include <cstring>
+
+namespace seastar::fs {
+
+// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
+// in order to finish construction.
+class to_disk_buffer {
+protected:
+ temporary_buffer<uint8_t> _buff;
+ size_t _max_size = 0;
+ unit_size_t _alignment = 0;
+ disk_offset_t _cluster_beg_offset = 0; // disk offset that corresponds to _buff.begin()
+ range<size_t> _unflushed_data = {0, 0}; // range of unflushed bytes in _buff
+
+public:
+ to_disk_buffer() = default;
+
+ to_disk_buffer(const to_disk_buffer&) = delete;
+ to_disk_buffer& operator=(const to_disk_buffer&) = delete;
+ to_disk_buffer(to_disk_buffer&&) = default;
+ to_disk_buffer& operator=(to_disk_buffer&&) = default;
+
+ // Total number of bytes appended cannot exceed @p aligned_max_size.
+ // @p cluster_beg_offset is the disk offset of the beginning of the cluster.
+ virtual void init(size_t aligned_max_size, unit_size_t alignment, disk_offset_t cluster_beg_offset) {
+ assert(is_power_of_2(alignment));
+ assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
+ assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
+
+ _max_size = aligned_max_size;
+ _alignment = alignment;
+ _cluster_beg_offset = cluster_beg_offset;
+ _unflushed_data = {0, 0};
+ _buff = decltype(_buff)::aligned(_alignment, _max_size);
+ start_new_unflushed_data();
+ }
+
+ virtual ~to_disk_buffer() = default;
+
+ /**
+ * @brief Writes buffered (unflushed) data to disk and starts a new unflushed data if there is enough space
+ * IMPORTANT: using this buffer before call to flush_to_disk() completes is perfectly OK
+ * @details After each flush we align the offset at which the new unflushed data is continued. This is very
+ * important, as it ensures that consecutive flushes, as their underlying write operations to a block device,
+ * do not overlap. If the writes overlapped, it would be possible that they would be written in the reverse order
+ * corrupting the on-disk data.
+ *
+ * @param device output device
+ */
+ virtual future<> flush_to_disk(block_device device) {
+ prepare_unflushed_data_for_flush();
+ // Data layout overview:
+ // |.........................|00000000000000000000000|
+ // ^ _unflushed_data.beg ^ _unflushed_data.end ^ real_write.end
+ // (aligned) (maybe unaligned) (aligned)
+ // == real_write.beg == new _unflushed_data.beg
+ // |<------ padding ------>|
+ assert(mod_by_power_of_2(_unflushed_data.beg, _alignment) == 0);
+ range real_write = {
+ _unflushed_data.beg,
+ round_up_to_multiple_of_power_of_2(_unflushed_data.end, _alignment),
+ };
+ // Pad buffer with zeros till alignment
+ range padding = {_unflushed_data.end, real_write.end};
+ std::memset(_buff.get_write() + padding.beg, 0, padding.size());
+
+ // Make sure the buffer is usable before returning from this function
+ _unflushed_data = {real_write.end, real_write.end};
+ start_new_unflushed_data();
+
+ return device.write(_cluster_beg_offset + real_write.beg, _buff.get_write() + real_write.beg, real_write.size())
+ .then([real_write](size_t written_bytes) {
+ if (written_bytes != real_write.size()) {
+ return make_exception_future<>(std::runtime_error("Partial write"));
+ // TODO: maybe add some way to retry write, because once the buffer is corrupt nothing can be done now
+ }
+
+ return now();
+ });
+ }
+
+protected:
+ // May be called before the flushing previous fragment is
+ virtual void start_new_unflushed_data() noexcept {}
+
+ virtual void prepare_unflushed_data_for_flush() noexcept {}
+
+public:
+ virtual void append_bytes(const void* data, size_t len) noexcept {
+ assert(len <= bytes_left());
+ std::memcpy(_buff.get_write() + _unflushed_data.end, data, len);
+ _unflushed_data.end += len;
+ }
+
+ // Returns maximum number of bytes that may be written to buffer without calling reset()
+ virtual size_t bytes_left() const noexcept { return _max_size - _unflushed_data.end; }
+
+ virtual size_t bytes_left_after_flush_if_done_now() const noexcept {
+ return _max_size - round_up_to_multiple_of_power_of_2(_unflushed_data.end, _alignment);
+ }
+
+ // Returns disk offset of the place where the first byte of next appended bytes would be after flush
+ // TODO: maybe better name for that function? Or any other method to extract that data?
+ virtual disk_offset_t current_disk_offset() const noexcept {
+ return _cluster_beg_offset + _unflushed_data.end;
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/unix_metadata.hh b/src/fs/unix_metadata.hh
new file mode 100644
index 00000000..6f634044
--- /dev/null
+++ b/src/fs/unix_metadata.hh
@@ -0,0 +1,40 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2019 ScyllaDB
+ */
+
+#pragma once
+
+#include "seastar/core/file-types.hh"
+
+#include <cstdint>
+#include <sys/types.h>
+
+namespace seastar::fs {
+
+struct unix_metadata {
+ file_permissions perms;
+ uid_t uid;
+ gid_t gid;
+ uint64_t btime_ns;
+ uint64_t mtime_ns;
+ uint64_t ctime_ns;
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
new file mode 100644
index 00000000..6e29f2e5
--- /dev/null
+++ b/src/fs/metadata_log.cc
@@ -0,0 +1,222 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#include "fs/cluster.hh"
+#include "fs/cluster_allocator.hh"
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/metadata_log_bootstrap.hh"
+#include "fs/metadata_to_disk_buffer.hh"
+#include "fs/path.hh"
+#include "fs/units.hh"
+#include "fs/unix_metadata.hh"
+#include "seastar/core/aligned_buffer.hh"
+#include "seastar/core/do_with.hh"
+#include "seastar/core/file-types.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/shared_mutex.hh"
+#include "seastar/fs/overloaded.hh"
+
+#include <boost/crc.hpp>
+#include <boost/range/irange.hpp>
+#include <chrono>
+#include <cstddef>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <stdexcept>
+#include <string_view>
+#include <unordered_set>
+#include <variant>
+
+namespace seastar::fs {
+
+metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t alignment,
+ shared_ptr<metadata_to_disk_buffer> cluster_buff)
+: _device(std::move(device))
+, _cluster_size(cluster_size)
+, _alignment(alignment)
+, _curr_cluster_buff(std::move(cluster_buff))
+, _cluster_allocator({}, {})
+, _inode_allocator(1, 0) {
+ assert(is_power_of_2(alignment));
+ assert(cluster_size > 0 and cluster_size % alignment == 0);
+}
+
+metadata_log::metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment)
+: metadata_log(std::move(device), cluster_size, alignment,
+ make_shared<metadata_to_disk_buffer>()) {}
+
+future<> metadata_log::bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
+ fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
+ return metadata_log_bootstrap::bootstrap(*this, root_dir, first_metadata_cluster_id, available_clusters,
+ fs_shards_pool_size, fs_shard_id);
+}
+
+future<> metadata_log::shutdown() {
+ return flush_log().then([this] {
+ return _device.close();
+ });
+}
+
+void metadata_log::schedule_flush_of_curr_cluster() {
+ // Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
+ schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {
+ return crr_clstr_bf->flush_to_disk(*device);
+ }));
+}
+
+future<> metadata_log::flush_curr_cluster() {
+ if (_curr_cluster_buff->bytes_left_after_flush_if_done_now() == 0) {
+ switch (schedule_flush_of_curr_cluster_and_change_it_to_new_one()) {
+ case flush_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case flush_result::DONE:
+ break;
+ }
+ } else {
+ schedule_flush_of_curr_cluster();
+ }
+
+ return _background_futures.get_future();
+}
+
+metadata_log::flush_result metadata_log::schedule_flush_of_curr_cluster_and_change_it_to_new_one() {
+ auto next_cluster = _cluster_allocator.alloc();
+ if (not next_cluster) {
+ // Here metadata log dies, we cannot even flush current cluster because from there we won't be able to recover
+ // TODO: ^ add protection from it and take it into account during compaction
+ return flush_result::NO_SPACE;
+ }
+
+ auto append_res = _curr_cluster_buff->append(ondisk_next_metadata_cluster {*next_cluster});
+ assert(append_res == metadata_to_disk_buffer::APPENDED);
+ schedule_flush_of_curr_cluster();
+
+ // Make next cluster the current cluster to allow writing next metadata entries before flushing finishes
+ _curr_cluster_buff->virtual_constructor();
+ _curr_cluster_buff->init(_cluster_size, _alignment,
+ cluster_id_to_offset(*next_cluster, _cluster_size));
+ return flush_result::DONE;
+}
+
+std::variant<inode_t, metadata_log::path_lookup_error> metadata_log::do_path_lookup(const std::string& path) const noexcept {
+ if (path.empty() or path[0] != '/') {
+ return path_lookup_error::NOT_ABSOLUTE;
+ }
+
+ std::vector<inode_t> components_stack = {_root_dir};
+ size_t beg = 0;
+ while (beg < path.size()) {
+ range component_range = {beg, path.find('/', beg)};
+ bool check_if_dir = false;
+ if (component_range.end == path.npos) {
+ component_range.end = path.size();
+ beg = path.size();
+ } else {
+ check_if_dir = true;
+ beg = component_range.end + 1; // Jump over '/'
+ }
+
+ std::string_view component(path.data() + component_range.beg, component_range.size());
+ // Process the component
+ if (component == "") {
+ continue;
+ } else if (component == ".") {
+ assert(component_range.beg > 0 and path[component_range.beg - 1] == '/' and "Since path is absolute we do not have to check if the current component is a directory");
+ continue;
+ } else if (component == "..") {
+ if (components_stack.size() > 1) { // Root dir cannot be popped
+ components_stack.pop_back();
+ }
+ } else {
+ auto dir_it = _inodes.find(components_stack.back());
+ assert(dir_it != _inodes.end() and "inode comes from some previous lookup (or is a root directory) hence dir_it has to be valid");
+ assert(dir_it->second.is_directory() and "every previous component is a directory and it was checked when they were processed");
+ auto& curr_dir = dir_it->second.get_directory();
+
+ auto it = curr_dir.entries.find(component);
+ if (it == curr_dir.entries.end()) {
+ return path_lookup_error::NO_ENTRY;
+ }
+
+ inode_t entry_inode = it->second;
+ if (check_if_dir) {
+ auto entry_it = _inodes.find(entry_inode);
+ assert(entry_it != _inodes.end() and "dir entries have to exist");
+ if (not entry_it->second.is_directory()) {
+ return path_lookup_error::NOT_DIR;
+ }
+ }
+
+ components_stack.emplace_back(entry_inode);
+ }
+ }
+
+ return components_stack.back();
+}
+
+future<inode_t> metadata_log::path_lookup(const std::string& path) const {
+ auto lookup_res = do_path_lookup(path);
+ return std::visit(overloaded {
+ [](path_lookup_error error) {
+ switch (error) {
+ case path_lookup_error::NOT_ABSOLUTE:
+ return make_exception_future<inode_t>(path_is_not_absolute_exception());
+ case path_lookup_error::NO_ENTRY:
+ return make_exception_future<inode_t>(no_such_file_or_directory_exception());
+ case path_lookup_error::NOT_DIR:
+ return make_exception_future<inode_t>(path_component_not_directory_exception());
+ }
+ __builtin_unreachable();
+ },
+ [](inode_t inode) {
+ return make_ready_future<inode_t>(inode);
+ }
+ }, lookup_res);
+}
+
+file_offset_t metadata_log::file_size(inode_t inode) const {
+ auto it = _inodes.find(inode);
+ if (it == _inodes.end()) {
+ throw invalid_inode_exception();
+ }
+
+ return std::visit(overloaded {
+ [](const inode_info::file& file) {
+ return file.size();
+ },
+ [](const inode_info::directory&) -> file_offset_t {
+ throw invalid_inode_exception();
+ }
+ }, it->second.contents);
+}
+
+// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
+// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
+// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
+// hard if we write metadata to the last cluster and there is no enough room to write these delete operations. We have to
+// guarantee that the filesystem is in a recoverable state then.
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
new file mode 100644
index 00000000..926d79fe
--- /dev/null
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -0,0 +1,264 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#include "fs/bitwise.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log_bootstrap.hh"
+#include "seastar/util/log.hh"
+
+namespace seastar::fs {
+
+namespace {
+logger mlogger("fs_metadata_bootstrap");
+} // namespace
+
+bool data_reader::read(void* destination, size_t size) {
+ if (_pos + size > _size) {
+ return false;
+ }
+
+ std::memcpy(destination, _data + _pos, size);
+ _pos += size;
+ return true;
+}
+
+bool data_reader::read_string(std::string& str, size_t size) {
+ str.resize(size);
+ return read(str.data(), size);
+}
+
+std::optional<temporary_buffer<uint8_t>> data_reader::read_tmp_buff(size_t size) {
+ if (_pos + size > _size) {
+ return std::nullopt;
+ }
+
+ _pos += size;
+ return temporary_buffer<uint8_t>(_data + _pos - size, size);
+}
+
+bool data_reader::process_crc_without_reading(boost::crc_32_type& crc, size_t size) {
+ if (_pos + size > _size) {
+ return false;
+ }
+
+ crc.process_bytes(_data + _pos, size);
+ return true;
+}
+
+std::optional<data_reader> data_reader::extract(size_t size) {
+ if (_pos + size > _size) {
+ return std::nullopt;
+ }
+
+ _pos += size;
+ return data_reader(_data + _pos - size, size);
+}
+
+metadata_log_bootstrap::metadata_log_bootstrap(metadata_log& metadata_log, cluster_range available_clusters)
+: _metadata_log(metadata_log)
+, _available_clusters(available_clusters)
+, _curr_cluster_data(decltype(_curr_cluster_data)::aligned(metadata_log._alignment, metadata_log._cluster_size))
+{}
+
+future<> metadata_log_bootstrap::bootstrap(cluster_id_t first_metadata_cluster_id, fs_shard_id_t fs_shards_pool_size,
+ fs_shard_id_t fs_shard_id) {
+ _next_cluster = first_metadata_cluster_id;
+ mlogger.debug(">>>> Started bootstraping <<<<");
+ return do_with((cluster_id_t)first_metadata_cluster_id, [this](cluster_id_t& last_cluster) {
+ return do_until([this] { return not _next_cluster.has_value(); }, [this, &last_cluster] {
+ cluster_id_t curr_cluster = *_next_cluster;
+ _next_cluster = std::nullopt;
+ bool inserted = _taken_clusters.emplace(curr_cluster).second;
+ assert(inserted); // TODO: check it in next_cluster record
+ last_cluster = curr_cluster;
+ return bootstrap_cluster(curr_cluster);
+ }).then([this, &last_cluster] {
+ mlogger.debug("Data bootstraping is done");
+ // Initialize _curr_cluster_buff
+ _metadata_log._curr_cluster_buff = _metadata_log._curr_cluster_buff->virtual_constructor();
+ mlogger.debug("Initializing _curr_cluster_buff: cluster {}, pos {}", last_cluster, _curr_cluster.last_checkpointed_pos());
+ _metadata_log._curr_cluster_buff->init_from_bootstrapped_cluster(_metadata_log._cluster_size,
+ _metadata_log._alignment, cluster_id_to_offset(last_cluster, _metadata_log._cluster_size),
+ _curr_cluster.last_checkpointed_pos());
+ });
+ }).then([this, fs_shards_pool_size, fs_shard_id] {
+ // Initialize _cluser_allocator
+ mlogger.debug("Initializing cluster allocator");
+ std::deque<cluster_id_t> free_clusters;
+ for (auto cid : boost::irange(_available_clusters.beg, _available_clusters.end)) {
+ if (_taken_clusters.count(cid) == 0) {
+ free_clusters.emplace_back(cid);
+ }
+ }
+ if (free_clusters.empty()) {
+ return make_exception_future(no_more_space_exception());
+ }
+ free_clusters.pop_front();
+
+ mlogger.debug("free clusters: {}", free_clusters.size());
+ _metadata_log._cluster_allocator = cluster_allocator(std::move(_taken_clusters), std::move(free_clusters));
+
+ // Reset _inode_allocator
+ std::optional<inode_t> max_inode_no;
+ if (not _metadata_log._inodes.empty()) {
+ max_inode_no =_metadata_log._inodes.rbegin()->first;
+ }
+ _metadata_log._inode_allocator = shard_inode_allocator(fs_shards_pool_size, fs_shard_id, max_inode_no);
+
+ // TODO: what about orphaned inodes: maybe they are remnants of unlinked files and we need to delete them,
+ // or maybe not?
+ return now();
+ });
+}
+
+future<> metadata_log_bootstrap::bootstrap_cluster(cluster_id_t curr_cluster) {
+ disk_offset_t curr_cluster_disk_offset = cluster_id_to_offset(curr_cluster, _metadata_log._cluster_size);
+ mlogger.debug("Bootstraping from cluster {}...", curr_cluster);
+ return _metadata_log._device.read(curr_cluster_disk_offset, _curr_cluster_data.get_write(),
+ _metadata_log._cluster_size).then([this, curr_cluster](size_t bytes_read) {
+ if (bytes_read != _metadata_log._cluster_size) {
+ return make_exception_future(std::runtime_error("Failed to read whole cluster of the metadata log"));
+ }
+
+ mlogger.debug("Read cluster {}", curr_cluster);
+ _curr_cluster = data_reader(_curr_cluster_data.get(), _metadata_log._cluster_size);
+ return bootstrap_read_cluster();
+ });
+}
+
+future<> metadata_log_bootstrap::bootstrap_read_cluster() {
+ // Process cluster: the data layout format is:
+ // | checkpoint1 | data1... | checkpoint2 | data2... | ... |
+ return do_with(false, [this](bool& whole_log_ended) {
+ return do_until([this, &whole_log_ended] { return whole_log_ended or _next_cluster.has_value(); },
+ [this, &whole_log_ended] {
+ _curr_cluster.align_curr_pos(_metadata_log._alignment);
+ _curr_cluster.checkpoint_curr_pos();
+
+ if (not read_and_check_checkpoint()) {
+ mlogger.debug("Checkpoint invalid");
+ whole_log_ended = true;
+ return now();
+ }
+
+ mlogger.debug("Checkpoint correct");
+ return bootstrap_checkpointed_data();
+ }).then([] {
+ mlogger.debug("Cluster ended");
+ });
+ });
+}
+
+bool metadata_log_bootstrap::read_and_check_checkpoint() {
+ mlogger.debug("Processing checkpoint at {}", _curr_cluster.curr_pos());
+ ondisk_type entry_type;
+ ondisk_checkpoint checkpoint;
+ if (not _curr_cluster.read_entry(entry_type)) {
+ mlogger.debug("Cannot read entry type");
+ return false;
+ }
+ if (entry_type != CHECKPOINT) {
+ mlogger.debug("Entry type (= {}) is not CHECKPOINT (= {})", entry_type, CHECKPOINT);
+ return false;
+ }
+ if (not _curr_cluster.read_entry(checkpoint)) {
+ mlogger.debug("Cannot read checkpoint entry");
+ return false;
+ }
+
+ boost::crc_32_type crc;
+ if (not _curr_cluster.process_crc_without_reading(crc, checkpoint.checkpointed_data_length)) {
+ mlogger.debug("Invalid checkpoint's data length: {}", (unit_size_t)checkpoint.checkpointed_data_length);
+ return false;
+ }
+ crc.process_bytes(&checkpoint.checkpointed_data_length, sizeof(checkpoint.checkpointed_data_length));
+ if (crc.checksum() != checkpoint.crc32_code) {
+ mlogger.debug("CRC code does not match: computed = {}, read = {}", crc.checksum(), (uint32_t)checkpoint.crc32_code);
+ return false;
+ }
+
+ auto opt = _curr_cluster.extract(checkpoint.checkpointed_data_length);
+ assert(opt.has_value());
+ _curr_checkpoint = *opt;
+ return true;
+}
+
+future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
+ return do_with(ondisk_type {}, [this](ondisk_type& entry_type) {
+ return do_until([this, &entry_type] { return not _curr_checkpoint.read_entry(entry_type); },
+ [this, &entry_type] {
+ switch (entry_type) {
+ case INVALID:
+ case CHECKPOINT: // CHECKPOINT cannot appear as part of checkpointed data
+ return invalid_entry_exception();
+ case NEXT_METADATA_CLUSTER:
+ return bootstrap_next_metadata_cluster();
+ }
+
+ // unknown type => metadata log corruption
+ return invalid_entry_exception();
+ }).then([this] {
+ if (_curr_checkpoint.bytes_left() > 0) {
+ return invalid_entry_exception(); // Corrupted checkpointed data
+ }
+ return now();
+ });
+ });
+}
+
+future<> metadata_log_bootstrap::bootstrap_next_metadata_cluster() {
+ ondisk_next_metadata_cluster entry;
+ if (not _curr_checkpoint.read_entry(entry)) {
+ return invalid_entry_exception();
+ }
+
+ if (_next_cluster.has_value()) {
+ return invalid_entry_exception(); // Only one NEXT_METADATA_CLUSTER may appear in one cluster
+ }
+
+ _next_cluster = (cluster_id_t)entry.cluster_id;
+ return now();
+}
+
+bool metadata_log_bootstrap::inode_exists(inode_t inode) {
+ return _metadata_log._inodes.count(inode) != 0;
+}
+
+future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
+ cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
+ // Clear the metadata log
+ metadata_log._inodes.clear();
+ metadata_log._background_futures = now();
+ metadata_log._root_dir = root_dir;
+ metadata_log._inodes.emplace(root_dir, inode_info {
+ 0,
+ 0,
+ {}, // TODO: change it to something meaningful
+ inode_info::directory {}
+ });
+
+ return do_with(metadata_log_bootstrap(metadata_log, available_clusters),
+ [first_metadata_cluster_id, fs_shards_pool_size, fs_shard_id](metadata_log_bootstrap& bootstrap) {
+ return bootstrap.bootstrap(first_metadata_cluster_id, fs_shards_pool_size, fs_shard_id);
+ });
+}
+
+} // namespace seastar::fs
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8a59eca6..19666a8a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -658,6 +658,7 @@ if (Seastar_EXPERIMENTAL_FS)
PRIVATE
# SeastarFS source files
include/seastar/fs/block_device.hh
+ include/seastar/fs/exceptions.hh
include/seastar/fs/file.hh
include/seastar/fs/overloaded.hh
include/seastar/fs/temporary_file.hh
@@ -670,9 +671,18 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/crc.hh
src/fs/file.cc
src/fs/inode.hh
+ src/fs/inode_info.hh
+ src/fs/metadata_disk_entries.hh
+ src/fs/metadata_log.cc
+ src/fs/metadata_log.hh
+ src/fs/metadata_log_bootstrap.cc
+ src/fs/metadata_log_bootstrap.hh
+ src/fs/metadata_to_disk_buffer.hh
src/fs/path.hh
src/fs/range.hh
+ src/fs/to_disk_buffer.hh
src/fs/units.hh
+ src/fs/unix_metadata.hh

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:33 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/metadata_disk_entries.hh | 11 ++
src/fs/metadata_log.hh | 8 +
src/fs/metadata_log_bootstrap.hh | 2 +
src/fs/metadata_log_operations/create_file.hh | 174 ++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 13 ++
src/fs/metadata_log.cc | 24 +++
src/fs/metadata_log_bootstrap.cc | 30 +++
CMakeLists.txt | 1 +
8 files changed, 263 insertions(+)
create mode 100644 src/fs/metadata_log_operations/create_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 437c2c2b..9c44b8cc 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -73,6 +73,7 @@ enum ondisk_type : uint8_t {
CHECKPOINT,
NEXT_METADATA_CLUSTER,
CREATE_INODE,
+ CREATE_INODE_AS_DIR_ENTRY,
};

struct ondisk_checkpoint {
@@ -102,11 +103,21 @@ struct ondisk_create_inode {
ondisk_unix_metadata metadata;
} __attribute__((packed));

+struct ondisk_create_inode_as_dir_entry_header {
+ ondisk_create_inode entry_inode;
+ inode_t dir_inode;
+ uint16_t entry_name_length;
+ // After header comes entry name
+} __attribute__((packed));
+
template<typename T>
constexpr size_t ondisk_entry_size(const T& entry) noexcept {
static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or
std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}
+constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {
+ return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
+}

} // namespace seastar::fs
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 6f069c13..cc11b865 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -157,6 +157,7 @@ class metadata_log {
friend class metadata_log_bootstrap;

friend class create_and_open_unlinked_file_operation;
+ friend class create_file_operation;

public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
@@ -179,6 +180,7 @@ class metadata_log {
}

inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
+ void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);

template<class Func>
void schedule_background_task(Func&& task) {
@@ -290,8 +292,14 @@ class metadata_log {
// Returns size of the file or throws exception iff @p inode is invalid
file_offset_t file_size(inode_t inode) const;

+ future<> create_file(std::string path, file_permissions perms);
+
+ future<inode_t> create_and_open_file(std::string path, file_permissions perms);
+
future<inode_t> create_and_open_unlinked_file(file_permissions perms);

+ future<> create_directory(std::string path, file_permissions perms);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 4a1fa7e9..d44c2f96 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -117,6 +117,8 @@ class metadata_log_bootstrap {

future<> bootstrap_create_inode();

+ future<> bootstrap_create_inode_as_dir_entry();
+
public:
static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
diff --git a/src/fs/metadata_log_operations/create_file.hh b/src/fs/metadata_log_operations/create_file.hh
new file mode 100644
index 00000000..3ba83226
--- /dev/null
+++ b/src/fs/metadata_log_operations/create_file.hh
@@ -0,0 +1,174 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/metadata_log.hh"
+#include "fs/path.hh"
+#include "seastar/core/future.hh"
+
+namespace seastar::fs {
+
+enum class create_semantics {
+ CREATE_FILE,
+ CREATE_AND_OPEN_FILE,
+ CREATE_DIR,
+};
+
+class create_file_operation {
+ metadata_log& _metadata_log;
+ create_semantics _create_semantics;
+ std::string _entry_name;
+ file_permissions _perms;
+ inode_t _dir_inode;
+ inode_info::directory* _dir_info;
+
+ create_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<inode_t> create_file(std::string path, file_permissions perms, create_semantics create_semantics) {
+ _create_semantics = create_semantics;
+ switch (create_semantics) {
+ case create_semantics::CREATE_FILE:
+ case create_semantics::CREATE_AND_OPEN_FILE:
+ break;
+ case create_semantics::CREATE_DIR:
+ while (not path.empty() and path.back() == '/') {
+ path.pop_back();
+ }
+ }
+
+ _entry_name = extract_last_component(path);
+ if (_entry_name.empty()) {
+ return make_exception_future<inode_t>(invalid_path_exception());
+ }
+ assert(path.empty() or path.back() == '/'); // Hence fast-checking for "is directory" is done in path_lookup
+
+ _perms = perms;
+ return _metadata_log.path_lookup(path).then([this](inode_t dir_inode) {
+ _dir_inode = dir_inode;
+ // Fail-fast checks before locking (as locking may be expensive)
+ auto dir_it = _metadata_log._inodes.find(_dir_inode);
+ if (dir_it == _metadata_log._inodes.end()) {
+ return make_exception_future<inode_t>(operation_became_invalid_exception());
+ }
+ assert(dir_it->second.is_directory() and "Directory cannot become file or there is a BUG in path_lookup");
+ _dir_info = &dir_it->second.get_directory();
+
+ if (_dir_info->entries.count(_entry_name) != 0) {
+ return make_exception_future<inode_t>(file_already_exists_exception());
+ }
+
+ return _metadata_log._locks.with_locks(metadata_log::locks::shared {dir_inode},
+ metadata_log::locks::unique {dir_inode, _entry_name}, [this] {
+ return create_file_in_directory();
+ });
+ });
+ }
+
+ future<inode_t> create_file_in_directory() {
+ if (not _metadata_log.inode_exists(_dir_inode)) {
+ return make_exception_future<inode_t>(operation_became_invalid_exception());
+ }
+
+ if (_dir_info->entries.count(_entry_name) != 0) {
+ return make_exception_future<inode_t>(file_already_exists_exception());
+ }
+
+ ondisk_create_inode_as_dir_entry_header ondisk_entry;
+ decltype(ondisk_entry.entry_name_length) entry_name_length;
+ if (_entry_name.size() > std::numeric_limits<decltype(entry_name_length)>::max()) {
+ // TODO: add an assert that the culster_size is not too small as it would cause to allocate all clusters
+ // and then return error ENOSPACE
+ return make_exception_future<inode_t>(filename_too_long_exception());
+ }
+ entry_name_length = _entry_name.size();
+
+ using namespace std::chrono;
+ uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ unix_metadata unx_mtdt = {
+ _perms,
+ 0, // TODO: Eventually, we'll want a user to be able to pass his credentials when bootstrapping the
+ 0, // file system -- that will allow us to authorize users on startup (e.g. via LDAP or whatnot).
+ curr_time_ns,
+ curr_time_ns,
+ curr_time_ns
+ };
+
+ bool creating_dir = [this] {
+ switch (_create_semantics) {
+ case create_semantics::CREATE_FILE:
+ case create_semantics::CREATE_AND_OPEN_FILE:
+ return false;
+ case create_semantics::CREATE_DIR:
+ return true;
+ }
+ __builtin_unreachable();
+ }();
+
+ inode_t new_inode = _metadata_log._inode_allocator.alloc();
+
+ ondisk_entry = {
+ {
+ new_inode,
+ creating_dir,
+ metadata_to_ondisk_metadata(unx_mtdt)
+ },
+ _dir_inode,
+ entry_name_length,
+ };
+
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future<inode_t>(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future<inode_t>(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ inode_info& new_inode_info = _metadata_log.memory_only_create_inode(new_inode,
+ creating_dir, unx_mtdt);
+ _metadata_log.memory_only_add_dir_entry(*_dir_info, new_inode, std::move(_entry_name));
+
+ switch (_create_semantics) {
+ case create_semantics::CREATE_FILE:
+ case create_semantics::CREATE_DIR:
+ break;
+ case create_semantics::CREATE_AND_OPEN_FILE:
+ // We don't have to lock, as there was no context switch since the allocation of the inode number
+ ++new_inode_info.opened_files_count;
+ break;
+ }
+
+ return make_ready_future<inode_t>(new_inode);
+ }
+ __builtin_unreachable();
+ }
+
+public:
+ static future<inode_t> perform(metadata_log& metadata_log, std::string path, file_permissions perms,
+ create_semantics create_semantics) {
+ return do_with(create_file_operation(metadata_log),
+ [path = std::move(path), perms = std::move(perms), create_semantics](auto& obj) {
+ return obj.create_file(std::move(path), std::move(perms), create_semantics);
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index 593ad46a..87b2bd8e 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -157,6 +157,19 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(CREATE_INODE, create_inode);
}

+ [[nodiscard]] virtual append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,
+ const void* entry_name) noexcept {
+ ondisk_type type = CREATE_INODE_AS_DIR_ENTRY;
+ if (not fits_for_append(ondisk_entry_size(create_inode_as_dir_entry))) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&create_inode_as_dir_entry, sizeof(create_inode_as_dir_entry));
+ append_bytes(entry_name, create_inode_as_dir_entry.entry_name_length);
+ return APPENDED;
+ }
+
using to_disk_buffer::flush_to_disk;
};

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index be523fc7..d35d3710 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -27,6 +27,7 @@
#include "fs/metadata_log.hh"
#include "fs/metadata_log_bootstrap.hh"
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
+#include "fs/metadata_log_operations/create_file.hh"
#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"
@@ -97,6 +98,17 @@ inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_direct
}).first->second;
}

+void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {
+ auto it = _inodes.find(entry_inode);
+ assert(it != _inodes.end());
+ // Directory may only be linked once (to avoid creating cycles)
+ assert(not it->second.is_directory() or not it->second.is_linked());
+
+ bool inserted = dir.entries.emplace(std::move(entry_name), entry_inode).second;
+ assert(inserted);
+ ++it->second.directories_containing_file;
+}
+
void metadata_log::schedule_flush_of_curr_cluster() {
// Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {
@@ -230,10 +242,22 @@ file_offset_t metadata_log::file_size(inode_t inode) const {
}, it->second.contents);
}

+future<> metadata_log::create_file(std::string path, file_permissions perms) {
+ return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_FILE).discard_result();
+}
+
+future<inode_t> metadata_log::create_and_open_file(std::string path, file_permissions perms) {
+ return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_AND_OPEN_FILE);
+}
+
future<inode_t> metadata_log::create_and_open_unlinked_file(file_permissions perms) {
return create_and_open_unlinked_file_operation::perform(*this, std::move(perms));
}

+future<> metadata_log::create_directory(std::string path, file_permissions perms) {
+ return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_DIR).discard_result();
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 702e0e34..01b567f0 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -213,6 +213,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_next_metadata_cluster();
case CREATE_INODE:
return bootstrap_create_inode();
+ case CREATE_INODE_AS_DIR_ENTRY:
+ return bootstrap_create_inode_as_dir_entry();
}

// unknown type => metadata log corruption
@@ -255,6 +257,34 @@ future<> metadata_log_bootstrap::bootstrap_create_inode() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
+ ondisk_create_inode_as_dir_entry_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
+ inode_exists(entry.entry_inode.inode)) {
+ return invalid_entry_exception();
+ }
+
+ std::string dir_entry_name;
+ if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
+ return invalid_entry_exception();
+ }
+ auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
+
+ if (dir.entries.count(dir_entry_name) != 0) {
+ return invalid_entry_exception();
+ }
+
+ _metadata_log.memory_only_create_inode(entry.entry_inode.inode, entry.entry_inode.is_directory,
+ ondisk_metadata_to_metadata(entry.entry_inode.metadata));
+ _metadata_log.memory_only_add_dir_entry(dir, entry.entry_inode.inode, std::move(dir_entry_name));
+ // TODO: Maybe mtime_ns for modifying directory?
+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
// Clear the metadata log
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 3304a02b..46cdf803 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -678,6 +678,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_bootstrap.cc
src/fs/metadata_log_bootstrap.hh
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
+ src/fs/metadata_log_operations/create_file.hh

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:34 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
Some operations need to schedule deleting inode in the background. One
of these is closing unlinked file if nobody else holds it open.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/metadata_disk_entries.hh | 8 ++++++-
src/fs/metadata_log.hh | 3 +++
src/fs/metadata_log_bootstrap.hh | 2 ++
src/fs/metadata_to_disk_buffer.hh | 4 ++++
src/fs/metadata_log.cc | 38 +++++++++++++++++++++++++++++++
src/fs/metadata_log_bootstrap.cc | 21 +++++++++++++++++
6 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 9c44b8cc..310b1864 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -73,6 +73,7 @@ enum ondisk_type : uint8_t {
CHECKPOINT,
NEXT_METADATA_CLUSTER,
CREATE_INODE,
+ DELETE_INODE,
CREATE_INODE_AS_DIR_ENTRY,
};

@@ -103,6 +104,10 @@ struct ondisk_create_inode {
ondisk_unix_metadata metadata;
} __attribute__((packed));

+struct ondisk_delete_inode {
+ inode_t inode;
+} __attribute__((packed));
+
struct ondisk_create_inode_as_dir_entry_header {
ondisk_create_inode entry_inode;
inode_t dir_inode;
@@ -113,7 +118,8 @@ struct ondisk_create_inode_as_dir_entry_header {
template<typename T>
constexpr size_t ondisk_entry_size(const T& entry) noexcept {
static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or
- std::is_same_v<T, ondisk_create_inode>, "ondisk entry size not defined for given type");
+ std::is_same_v<T, ondisk_create_inode> or
+ std::is_same_v<T, ondisk_delete_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}
constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index cc11b865..be5e843b 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -180,6 +180,7 @@ class metadata_log {
}

inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
+ void memory_only_delete_inode(inode_t inode);
void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);

template<class Func>
@@ -232,6 +233,8 @@ class metadata_log {
__builtin_unreachable();
}

+ void schedule_attempt_to_delete_inode(inode_t inode);
+
enum class path_lookup_error {
NOT_ABSOLUTE, // a path is not absolute
NO_ENTRY, // no such file or directory
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index d44c2f96..b28bce7f 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -117,6 +117,8 @@ class metadata_log_bootstrap {

future<> bootstrap_create_inode();

+ future<> bootstrap_delete_inode();
+
future<> bootstrap_create_inode_as_dir_entry();

public:
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index 87b2bd8e..9eb1c538 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -157,6 +157,10 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(CREATE_INODE, create_inode);
}

+ [[nodiscard]] virtual append_result append(const ondisk_delete_inode& delete_inode) noexcept {
+ return append_simple(DELETE_INODE, delete_inode);
+ }
+
[[nodiscard]] virtual append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,
const void* entry_name) noexcept {
ondisk_type type = CREATE_INODE_AS_DIR_ENTRY;
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index d35d3710..7f42f353 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -98,6 +98,24 @@ inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_direct
}).first->second;
}

+void metadata_log::memory_only_delete_inode(inode_t inode) {
+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());
+ assert(not it->second.is_open());
+ assert(not it->second.is_linked());
+
+ std::visit(overloaded {
+ [](const inode_info::directory& dir) {
+ assert(dir.entries.empty());
+ },
+ [](const inode_info::file&) {
+ // TODO: for compaction: update used inode_data_vec
+ }
+ }, it->second.contents);
+
+ _inodes.erase(it);
+}
+
void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {
auto it = _inodes.find(entry_inode);
assert(it != _inodes.end());
@@ -150,6 +168,26 @@ metadata_log::flush_result metadata_log::schedule_flush_of_curr_cluster_and_chan
return flush_result::DONE;
}

+void metadata_log::schedule_attempt_to_delete_inode(inode_t inode) {
+ return schedule_background_task([this, inode] {
+ auto it = _inodes.find(inode);
+ if (it == _inodes.end() or it->second.is_linked() or it->second.is_open()) {
+ return now(); // Scheduled delete became invalid
+ }
+
+ switch (append_ondisk_entry(ondisk_delete_inode {inode})) {
+ case append_result::TOO_BIG:
+ assert(false and "ondisk entry cannot be too big");
+ case append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case append_result::APPENDED:
+ memory_only_delete_inode(inode);
+ return now();
+ }
+ __builtin_unreachable();
+ });
+}
+
std::variant<inode_t, metadata_log::path_lookup_error> metadata_log::do_path_lookup(const std::string& path) const noexcept {
if (path.empty() or path[0] != '/') {
return path_lookup_error::NOT_ABSOLUTE;
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 01b567f0..3058328a 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -213,6 +213,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_next_metadata_cluster();
case CREATE_INODE:
return bootstrap_create_inode();
+ case DELETE_INODE:
+ return bootstrap_delete_inode();
case CREATE_INODE_AS_DIR_ENTRY:
return bootstrap_create_inode_as_dir_entry();
}
@@ -257,6 +259,25 @@ future<> metadata_log_bootstrap::bootstrap_create_inode() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_delete_inode() {
+ ondisk_delete_inode entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ inode_info& inode_info = _metadata_log._inodes.at(entry.inode);
+ if (inode_info.directories_containing_file > 0) {
+ return invalid_entry_exception(); // Only unlinked inodes may be deleted
+ }
+
+ if (inode_info.is_directory() and not inode_info.get_directory().entries.empty()) {
+ return invalid_entry_exception(); // Only empty directories may be deleted
+ }
+
+ _metadata_log.memory_only_delete_inode(entry.inode);
+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
ondisk_create_inode_as_dir_entry_header entry;
if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:37 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
Allows the same file to be visible via different paths or to give a path
to an unlinked file.

Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/metadata_disk_entries.hh | 11 ++
src/fs/metadata_log.hh | 7 ++
src/fs/metadata_log_bootstrap.hh | 2 +
src/fs/metadata_log_operations/link_file.hh | 112 ++++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 12 +++
src/fs/metadata_log.cc | 11 ++
src/fs/metadata_log_bootstrap.cc | 34 ++++++
CMakeLists.txt | 1 +
8 files changed, 190 insertions(+)
create mode 100644 src/fs/metadata_log_operations/link_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 310b1864..b81c25f5 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -74,6 +74,7 @@ enum ondisk_type : uint8_t {
NEXT_METADATA_CLUSTER,
CREATE_INODE,
DELETE_INODE,
+ ADD_DIR_ENTRY,
CREATE_INODE_AS_DIR_ENTRY,
};

@@ -108,6 +109,13 @@ struct ondisk_delete_inode {
inode_t inode;
} __attribute__((packed));

+struct ondisk_add_dir_entry_header {
+ inode_t dir_inode;
+ inode_t entry_inode;
+ uint16_t entry_name_length;
+ // After header comes entry name
+} __attribute__((packed));
+
struct ondisk_create_inode_as_dir_entry_header {
ondisk_create_inode entry_inode;
inode_t dir_inode;
@@ -122,6 +130,9 @@ constexpr size_t ondisk_entry_size(const T& entry) noexcept {
std::is_same_v<T, ondisk_delete_inode>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}
+constexpr size_t ondisk_entry_size(const ondisk_add_dir_entry_header& entry) noexcept {
+ return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
+}
constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {
return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
}
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index be5e843b..f5373458 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -158,6 +158,7 @@ class metadata_log {

friend class create_and_open_unlinked_file_operation;
friend class create_file_operation;
+ friend class link_file_operation;

public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
@@ -303,6 +304,12 @@ class metadata_log {

future<> create_directory(std::string path, file_permissions perms);

+ // Creates name (@p path) for a file (@p inode)
+ future<> link_file(inode_t inode, std::string path);
+
+ // Creates name (@p destination) for a file (not directory) @p source
+ future<> link_file(std::string source, std::string destination);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index b28bce7f..3b38b232 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -119,6 +119,8 @@ class metadata_log_bootstrap {

future<> bootstrap_delete_inode();

+ future<> bootstrap_add_dir_entry();
+
future<> bootstrap_create_inode_as_dir_entry();

public:
diff --git a/src/fs/metadata_log_operations/link_file.hh b/src/fs/metadata_log_operations/link_file.hh
new file mode 100644
index 00000000..207fe327
--- /dev/null
+++ b/src/fs/metadata_log_operations/link_file.hh
@@ -0,0 +1,112 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/path.hh"
+
+namespace seastar::fs {
+
+class link_file_operation {
+ metadata_log& _metadata_log;
+ inode_t _src_inode;
+ std::string _entry_name;
+ inode_t _dir_inode;
+ inode_info::directory* _dir_info;
+
+ link_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<> link_file(inode_t inode, std::string path) {
+ _src_inode = inode;
+ _entry_name = extract_last_component(path);
+ if (_entry_name.empty()) {
+ return make_exception_future(is_directory_exception());
+ }
+ assert(path.empty() or path.back() == '/'); // Hence fast-checking for "is directory" is done in path_lookup
+
+ return _metadata_log.path_lookup(path).then([this](inode_t dir_inode) {
+ _dir_inode = dir_inode;
+ // Fail-fast checks before locking (as locking may be expensive)
+ auto dir_it = _metadata_log._inodes.find(_dir_inode);
+ if (dir_it == _metadata_log._inodes.end()) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+ assert(dir_it->second.is_directory() and "Directory cannot become file or there is a BUG in path_lookup");
+ _dir_info = &dir_it->second.get_directory();
+
+ if (_dir_info->entries.count(_entry_name) != 0) {
+ return make_exception_future(file_already_exists_exception());
+ }
+
+ return _metadata_log._locks.with_locks(metadata_log::locks::shared {dir_inode},
+ metadata_log::locks::unique {dir_inode, _entry_name}, [this] {
+ return link_file_in_directory();
+ });
+ });
+ }
+
+ future<> link_file_in_directory() {
+ if (not _metadata_log.inode_exists(_dir_inode)) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+
+ if (_dir_info->entries.count(_entry_name) != 0) {
+ return make_exception_future(file_already_exists_exception());
+ }
+
+ ondisk_add_dir_entry_header ondisk_entry;
+ decltype(ondisk_entry.entry_name_length) entry_name_length;
+ if (_entry_name.size() > std::numeric_limits<decltype(entry_name_length)>::max()) {
+ // TODO: add an assert that the culster_size is not too small as it would cause to allocate all clusters
+ // and then return error ENOSPACE
+ return make_exception_future(filename_too_long_exception());
+ }
+ entry_name_length = _entry_name.size();
+
+ ondisk_entry = {
+ _dir_inode,
+ _src_inode,
+ entry_name_length,
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ _metadata_log.memory_only_add_dir_entry(*_dir_info, _src_inode, std::move(_entry_name));
+ return now();
+ }
+ __builtin_unreachable();
+ }
+
+public:
+ static future<> perform(metadata_log& metadata_log, inode_t inode, std::string path) {
+ return do_with(link_file_operation(metadata_log), [inode, path = std::move(path)](auto& obj) {
+ return obj.link_file(inode, std::move(path));
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index 9eb1c538..38180224 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -161,6 +161,18 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(DELETE_INODE, delete_inode);
}

+ [[nodiscard]] virtual append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept {
+ ondisk_type type = ADD_DIR_ENTRY;
+ if (not fits_for_append(ondisk_entry_size(add_dir_entry))) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&add_dir_entry, sizeof(add_dir_entry));
+ append_bytes(entry_name, add_dir_entry.entry_name_length);
+ return APPENDED;
+ }
+
[[nodiscard]] virtual append_result append(const ondisk_create_inode_as_dir_entry_header& create_inode_as_dir_entry,
const void* entry_name) noexcept {
ondisk_type type = CREATE_INODE_AS_DIR_ENTRY;
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 7f42f353..a8b17c2b 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -28,6 +28,7 @@
#include "fs/metadata_log_bootstrap.hh"
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_log_operations/create_file.hh"
+#include "fs/metadata_log_operations/link_file.hh"
#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"
@@ -296,6 +297,16 @@ future<> metadata_log::create_directory(std::string path, file_permissions perms
return create_file_operation::perform(*this, std::move(path), std::move(perms), create_semantics::CREATE_DIR).discard_result();
}

+future<> metadata_log::link_file(inode_t inode, std::string path) {
+ return link_file_operation::perform(*this, inode, std::move(path));
+}
+
+future<> metadata_log::link_file(std::string source, std::string destination) {
+ return path_lookup(std::move(source)).then([this, destination = std::move(destination)](inode_t inode) {
+ return link_file(inode, std::move(destination));
+ });
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 3058328a..64396d11 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -215,6 +215,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_create_inode();
case DELETE_INODE:
return bootstrap_delete_inode();
+ case ADD_DIR_ENTRY:
+ return bootstrap_add_dir_entry();
case CREATE_INODE_AS_DIR_ENTRY:
return bootstrap_create_inode_as_dir_entry();
}
@@ -278,6 +280,38 @@ future<> metadata_log_bootstrap::bootstrap_delete_inode() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_add_dir_entry() {
+ ondisk_add_dir_entry_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
+ not inode_exists(entry.entry_inode)) {
+ return invalid_entry_exception();
+ }
+
+ std::string dir_entry_name;
+ if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {
+ return invalid_entry_exception();
+ }
+
+ // Only files may be linked as not to create cycles (directories are created and linked using
+ // CREATE_INODE_AS_DIR_ENTRY)
+ if (not _metadata_log._inodes[entry.entry_inode].is_file()) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
+ return invalid_entry_exception();
+ }
+ auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
+
+ if (dir.entries.count(dir_entry_name) != 0) {
+ return invalid_entry_exception();
+ }
+
+ _metadata_log.memory_only_add_dir_entry(dir, entry.entry_inode, std::move(dir_entry_name));
+ // TODO: Maybe mtime_ns for modifying directory?
+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
ondisk_create_inode_as_dir_entry_header entry;
if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 46cdf803..6259742e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -679,6 +679,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_bootstrap.hh
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_log_operations/create_file.hh
+ src/fs/metadata_log_operations/link_file.hh

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:37 AM4/20/20
to seast...@googlegroups.com, Michał Niciejewski, sa...@scylladb.com, ank...@gmail.com, wmi...@protonmail.com
From: Michał Niciejewski <qup...@gmail.com>

Marks that the file is opened by increasing the opened file counter.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---
src/fs/metadata_log.hh | 3 +++
src/fs/metadata_log.cc | 22 ++++++++++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 02af052e..7bb2c6bc 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -319,6 +319,9 @@ class metadata_log {
// Removes empty directory or unlinks file
future<> remove(std::string path);

+ // TODO: what about permissions, uid, gid etc.
+ future<inode_t> open_file(std::string path);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 752682e4..00ce88d2 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -332,6 +332,28 @@ future<> metadata_log::remove(std::string path) {
return unlink_or_remove_file_operation::perform(*this, std::move(path), remove_semantics::DIR_ONLY);
}

+future<inode_t> metadata_log::open_file(std::string path) {
+ return path_lookup(path).then([this](inode_t inode) {
+ auto inode_it = _inodes.find(inode);
+ if (inode_it == _inodes.end()) {
+ return make_exception_future<inode_t>(operation_became_invalid_exception());
+ }
+ inode_info* inode_info = &inode_it->second;
+ if (inode_info->is_directory()) {
+ return make_exception_future<inode_t>(is_directory_exception());
+ }
+
+ // TODO: can be replaced by sth like _inode_info.during_delete
+ return _locks.with_lock(metadata_log::locks::shared {inode}, [this, inode_info = std::move(inode_info), inode] {
+ if (not inode_exists(inode)) {
+ return make_exception_future<inode_t>(operation_became_invalid_exception());
+ }
+ ++inode_info->opened_files_count;
+ return make_ready_future<inode_t>(inode);
+ });
+ });
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:37 AM4/20/20
to seast...@googlegroups.com, Krzysztof Małysa, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com, wmi...@protonmail.com
Signed-off-by: Krzysztof Małysa <var...@gmail.com>
---
src/fs/metadata_disk_entries.hh | 21 ++
src/fs/metadata_log.hh | 9 +
src/fs/metadata_log_bootstrap.hh | 4 +
.../unlink_or_remove_file.hh | 196 ++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 24 +++
src/fs/metadata_log.cc | 25 +++
src/fs/metadata_log_bootstrap.cc | 70 +++++++
CMakeLists.txt | 1 +
8 files changed, 350 insertions(+)
create mode 100644 src/fs/metadata_log_operations/unlink_or_remove_file.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index b81c25f5..2f363a9b 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -76,6 +76,8 @@ enum ondisk_type : uint8_t {
DELETE_INODE,
ADD_DIR_ENTRY,
CREATE_INODE_AS_DIR_ENTRY,
+ DELETE_DIR_ENTRY,
+ DELETE_INODE_AND_DIR_ENTRY,
};

struct ondisk_checkpoint {
@@ -123,6 +125,19 @@ struct ondisk_create_inode_as_dir_entry_header {
// After header comes entry name
} __attribute__((packed));

+struct ondisk_delete_dir_entry_header {
+ inode_t dir_inode;
+ uint16_t entry_name_length;
+ // After header comes entry name
+} __attribute__((packed));
+
+struct ondisk_delete_inode_and_dir_entry_header {
+ inode_t inode_to_delete;
+ inode_t dir_inode;
+ uint16_t entry_name_length;
+ // After header comes entry name
+} __attribute__((packed));
+
template<typename T>
constexpr size_t ondisk_entry_size(const T& entry) noexcept {
static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or
@@ -136,5 +151,11 @@ constexpr size_t ondisk_entry_size(const ondisk_add_dir_entry_header& entry) noe
constexpr size_t ondisk_entry_size(const ondisk_create_inode_as_dir_entry_header& entry) noexcept {
return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
}
+constexpr size_t ondisk_entry_size(const ondisk_delete_dir_entry_header& entry) noexcept {
+ return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
+}
+constexpr size_t ondisk_entry_size(const ondisk_delete_inode_and_dir_entry_header& entry) noexcept {
+ return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
+}

} // namespace seastar::fs
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index f5373458..02af052e 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -159,6 +159,7 @@ class metadata_log {
friend class create_and_open_unlinked_file_operation;
friend class create_file_operation;
friend class link_file_operation;
+ friend class unlink_or_remove_file_operation;

public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
@@ -183,6 +184,7 @@ class metadata_log {
inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
void memory_only_delete_inode(inode_t inode);
void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);
+ void memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name);

template<class Func>
void schedule_background_task(Func&& task) {
@@ -310,6 +312,13 @@ class metadata_log {
// Creates name (@p destination) for a file (not directory) @p source
future<> link_file(std::string source, std::string destination);

+ future<> unlink_file(std::string path);
+
+ future<> remove_directory(std::string path);
+
+ // Removes empty directory or unlinks file
+ future<> remove(std::string path);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 3b38b232..16b429ab 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -123,6 +123,10 @@ class metadata_log_bootstrap {

future<> bootstrap_create_inode_as_dir_entry();

+ future<> bootstrap_delete_dir_entry();
+
+ future<> bootstrap_delete_inode_and_dir_entry();
+
public:
static future<> bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id);
diff --git a/src/fs/metadata_log_operations/unlink_or_remove_file.hh b/src/fs/metadata_log_operations/unlink_or_remove_file.hh
new file mode 100644
index 00000000..a5f29cd9
--- /dev/null
+++ b/src/fs/metadata_log_operations/unlink_or_remove_file.hh
@@ -0,0 +1,196 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/path.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+
+namespace seastar::fs {
+
+enum class remove_semantics {
+ FILE_ONLY,
+ DIR_ONLY,
+ FILE_OR_DIR,
+};
+
+class unlink_or_remove_file_operation {
+ metadata_log& _metadata_log;
+ remove_semantics _remove_semantics;
+ std::string _entry_name;
+ inode_t _dir_inode;
+ inode_info::directory* _dir_info;
+ inode_t _entry_inode;
+ inode_info* _entry_inode_info;
+
+ unlink_or_remove_file_operation(metadata_log& metadata_log) : _metadata_log(metadata_log) {}
+
+ future<> unlink_or_remove(std::string path, remove_semantics remove_semantics) {
+ _remove_semantics = remove_semantics;
+ while (not path.empty() and path.back() == '/') {
+ path.pop_back();
+ }
+
+ _entry_name = extract_last_component(path);
+ if (_entry_name.empty()) {
+ return make_exception_future(invalid_path_exception()); // We cannot remove "/"
+ }
+ assert(path.empty() or path.back() == '/'); // Hence fast-check for "is directory" is done in path_lookup
+
+ return _metadata_log.path_lookup(path).then([this](inode_t dir_inode) {
+ _dir_inode = dir_inode;
+ // Fail-fast checks before locking (as locking may be expensive)
+ auto dir_it = _metadata_log._inodes.find(dir_inode);
+ if (dir_it == _metadata_log._inodes.end()) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+ assert(dir_it->second.is_directory() and "Directory cannot become file or there is a BUG in path_lookup");
+ _dir_info = &dir_it->second.get_directory();
+
+ auto entry_it = _dir_info->entries.find(_entry_name);
+ if (entry_it == _dir_info->entries.end()) {
+ return make_exception_future(no_such_file_or_directory_exception());
+ }
+ _entry_inode = entry_it->second;
+
+ _entry_inode_info = &_metadata_log._inodes.at(_entry_inode);
+ if (_entry_inode_info->is_directory()) {
+ switch (_remove_semantics) {
+ case remove_semantics::FILE_ONLY:
+ return make_exception_future(is_directory_exception());
+ case remove_semantics::DIR_ONLY:
+ case remove_semantics::FILE_OR_DIR:
+ break;
+ }
+
+ if (not _entry_inode_info->get_directory().entries.empty()) {
+ return make_exception_future(directory_not_empty_exception());
+ }
+ } else {
+ assert(_entry_inode_info->is_file());
+ switch (_remove_semantics) {
+ case remove_semantics::DIR_ONLY:
+ return make_exception_future(is_directory_exception());
+ case remove_semantics::FILE_ONLY:
+ case remove_semantics::FILE_OR_DIR:
+ break;
+ }
+ }
+
+ // Getting a lock on directory entry is enough to ensure it won't disappear because deleting directory
+ // requires it to be empty
+ if (_entry_inode_info->is_directory()) {
+ return _metadata_log._locks.with_locks(metadata_log::locks::unique {dir_inode, _entry_name}, metadata_log::locks::unique {_entry_inode}, [this] {
+ return unlink_or_remove_file_in_directory();
+ });
+ } else {
+ return _metadata_log._locks.with_locks(metadata_log::locks::unique {dir_inode, _entry_name}, metadata_log::locks::shared {_entry_inode}, [this] {
+ return unlink_or_remove_file_in_directory();
+ });
+ }
+ });
+ }
+
+ future<> unlink_or_remove_file_in_directory() {
+ if (not _metadata_log.inode_exists(_dir_inode)) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+
+ auto entry_it = _dir_info->entries.find(_entry_name);
+ if (entry_it == _dir_info->entries.end() or entry_it->second != _entry_inode) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+
+ if (_entry_inode_info->is_directory()) {
+ inode_info::directory& dir = _entry_inode_info->get_directory();
+ if (not dir.entries.empty()) {
+ return make_exception_future(directory_not_empty_exception());
+ }
+
+ assert(_entry_inode_info->directories_containing_file == 1);
+
+ // Ready to delete directory
+ ondisk_delete_inode_and_dir_entry_header ondisk_entry;
+ using entry_name_length_t = decltype(ondisk_entry.entry_name_length);
+ assert(_entry_name.size() <= std::numeric_limits<entry_name_length_t>::max());
+
+ ondisk_entry = {
+ _entry_inode,
+ _dir_inode,
+ static_cast<entry_name_length_t>(_entry_name.size())
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ _metadata_log.memory_only_delete_dir_entry(*_dir_info, _entry_name);
+ _metadata_log.memory_only_delete_inode(_entry_inode);
+ return now();
+ }
+ __builtin_unreachable();
+ }
+
+ assert(_entry_inode_info->is_file());
+
+ // Ready to unlink file
+ ondisk_delete_dir_entry_header ondisk_entry;
+ using entry_name_length_t = decltype(ondisk_entry.entry_name_length);
+ assert(_entry_name.size() <= std::numeric_limits<entry_name_length_t>::max());
+ ondisk_entry = {
+ _dir_inode,
+ static_cast<entry_name_length_t>(_entry_name.size())
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, _entry_name.data())) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ _metadata_log.memory_only_delete_dir_entry(*_dir_info, _entry_name);
+ break;
+ }
+
+ if (not _entry_inode_info->is_linked() and not _entry_inode_info->is_open()) {
+ // File became unlinked and not open, so we need to delete it
+ _metadata_log.schedule_attempt_to_delete_inode(_entry_inode);
+ }
+
+ return now();
+ }
+
+public:
+ static future<> perform(metadata_log& metadata_log, std::string path, remove_semantics remove_semantics) {
+ return do_with(unlink_or_remove_file_operation(metadata_log), [path = std::move(path), remove_semantics](auto& obj) {
+ return obj.unlink_or_remove(std::move(path), remove_semantics);
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index 38180224..979a03c2 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -186,6 +186,30 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return APPENDED;
}

+ [[nodiscard]] virtual append_result append(const ondisk_delete_dir_entry_header& delete_dir_entry, const void* entry_name) noexcept {
+ ondisk_type type = DELETE_DIR_ENTRY;
+ if (not fits_for_append(ondisk_entry_size(delete_dir_entry))) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&delete_dir_entry, sizeof(delete_dir_entry));
+ append_bytes(entry_name, delete_dir_entry.entry_name_length);
+ return APPENDED;
+ }
+
+ [[nodiscard]] virtual append_result append(const ondisk_delete_inode_and_dir_entry_header& delete_inode_and_dir_entry, const void* entry_name) noexcept {
+ ondisk_type type = DELETE_INODE_AND_DIR_ENTRY;
+ if (not fits_for_append(ondisk_entry_size(delete_inode_and_dir_entry))) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&delete_inode_and_dir_entry, sizeof(delete_inode_and_dir_entry));
+ append_bytes(entry_name, delete_inode_and_dir_entry.entry_name_length);
+ return APPENDED;
+ }
+
using to_disk_buffer::flush_to_disk;
};

diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index a8b17c2b..752682e4 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -29,6 +29,7 @@
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_log_operations/create_file.hh"
#include "fs/metadata_log_operations/link_file.hh"
+#include "fs/metadata_log_operations/unlink_or_remove_file.hh"
#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"
@@ -128,6 +129,18 @@ void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t
++it->second.directories_containing_file;
}

+void metadata_log::memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name) {
+ auto it = dir.entries.find(entry_name);
+ assert(it != dir.entries.end());
+
+ auto entry_it = _inodes.find(it->second);
+ assert(entry_it != _inodes.end());
+ assert(entry_it->second.is_linked());
+
+ --entry_it->second.directories_containing_file;
+ dir.entries.erase(it);
+}
+
void metadata_log::schedule_flush_of_curr_cluster() {
// Make writes concurrent (TODO: maybe serialized within *one* cluster would be faster?)
schedule_background_task(do_with(_curr_cluster_buff, &_device, [](auto& crr_clstr_bf, auto& device) {
@@ -307,6 +320,18 @@ future<> metadata_log::link_file(std::string source, std::string destination) {
});
}

+future<> metadata_log::unlink_file(std::string path) {
+ return unlink_or_remove_file_operation::perform(*this, std::move(path), remove_semantics::FILE_ONLY);
+}
+
+future<> metadata_log::remove_directory(std::string path) {
+ return unlink_or_remove_file_operation::perform(*this, std::move(path), remove_semantics::FILE_OR_DIR);
+}
+
+future<> metadata_log::remove(std::string path) {
+ return unlink_or_remove_file_operation::perform(*this, std::move(path), remove_semantics::DIR_ONLY);
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 64396d11..3120fbd4 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -219,6 +219,10 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_add_dir_entry();
case CREATE_INODE_AS_DIR_ENTRY:
return bootstrap_create_inode_as_dir_entry();
+ case DELETE_DIR_ENTRY:
+ return bootstrap_delete_dir_entry();
+ case DELETE_INODE_AND_DIR_ENTRY:
+ return bootstrap_delete_inode_and_dir_entry();
}

// unknown type => metadata log corruption
@@ -340,6 +344,72 @@ future<> metadata_log_bootstrap::bootstrap_create_inode_as_dir_entry() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_delete_dir_entry() {
+ ondisk_delete_dir_entry_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode)) {
+ return invalid_entry_exception();
+ }
+
+ std::string dir_entry_name;
+ if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
+ return invalid_entry_exception();
+ }
+ auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
+
+ auto it = dir.entries.find(dir_entry_name);
+ if (it == dir.entries.end()) {
+ return invalid_entry_exception();
+ }
+
+ _metadata_log.memory_only_delete_dir_entry(dir, std::move(dir_entry_name));
+ // TODO: Maybe mtime_ns for modifying directory?
+ return now();
+}
+
+future<> metadata_log_bootstrap::bootstrap_delete_inode_and_dir_entry() {
+ ondisk_delete_inode_and_dir_entry_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or not inode_exists(entry.inode_to_delete)) {
+ return invalid_entry_exception();
+ }
+
+ std::string dir_entry_name;
+ if (not _curr_checkpoint.read_string(dir_entry_name, entry.entry_name_length)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.dir_inode].is_directory()) {
+ return invalid_entry_exception();
+ }
+ auto& dir = _metadata_log._inodes[entry.dir_inode].get_directory();
+
+ auto it = dir.entries.find(dir_entry_name);
+ if (it == dir.entries.end()) {
+ return invalid_entry_exception();
+ }
+
+ _metadata_log.memory_only_delete_dir_entry(dir, std::move(dir_entry_name));
+ // TODO: Maybe mtime_ns for modifying directory?
+
+ // TODO: there is so much copy & paste here...
+ // TODO: maybe to make ondisk_delete_inode_and_dir_entry_header have ondisk_delete_inode and
+ // ondisk_delete_dir_entry_header to ease deduplicating code?
+ inode_info& inode_to_delete_info = _metadata_log._inodes.at(entry.inode_to_delete);
+ if (inode_to_delete_info.directories_containing_file > 0) {
+ return invalid_entry_exception(); // Only unlinked inodes may be deleted
+ }
+
+ if (inode_to_delete_info.is_directory() and not inode_to_delete_info.get_directory().entries.empty()) {
+ return invalid_entry_exception(); // Only empty directories may be deleted
+ }
+
+ _metadata_log.memory_only_delete_inode(entry.inode_to_delete);
+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap(metadata_log& metadata_log, inode_t root_dir, cluster_id_t first_metadata_cluster_id,
cluster_range available_clusters, fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
// Clear the metadata log
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 6259742e..e432e572 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -680,6 +680,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_log_operations/create_file.hh
src/fs/metadata_log_operations/link_file.hh
+ src/fs/metadata_log_operations/unlink_or_remove_file.hh

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:38 AM4/20/20
to seast...@googlegroups.com, Michał Niciejewski, sa...@scylladb.com, ank...@gmail.com, wmi...@protonmail.com
From: Michał Niciejewski <qup...@gmail.com>

Decreases opened file counter. If the file is unlinked and the
counter is zero then the file is automatically removed.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---
src/fs/metadata_log.hh | 2 ++
src/fs/metadata_log.cc | 27 +++++++++++++++++++++++++++
2 files changed, 29 insertions(+)

diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 7bb2c6bc..721e43b8 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -322,6 +322,8 @@ class metadata_log {
// TODO: what about permissions, uid, gid etc.
future<inode_t> open_file(std::string path);

+ future<> close_file(inode_t inode);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 00ce88d2..56954cf1 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -354,6 +354,33 @@ future<inode_t> metadata_log::open_file(std::string path) {
});
}

+future<> metadata_log::close_file(inode_t inode) {
+ auto inode_it = _inodes.find(inode);
+ if (inode_it == _inodes.end()) {
+ return make_exception_future(invalid_inode_exception());
+ }
+ inode_info* inode_info = &inode_it->second;
+ if (inode_info->is_directory()) {
+ return make_exception_future(is_directory_exception());
+ }
+
+
+ return _locks.with_lock(metadata_log::locks::shared {inode}, [this, inode, inode_info] {
+ if (not inode_exists(inode)) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+
+ assert(inode_info->is_open());
+
+ --inode_info->opened_files_count;
+ if (not inode_info->is_linked() and not inode_info->is_open()) {
+ // Unlinked and not open file should be removed
+ schedule_attempt_to_delete_inode(inode);
+ }
+ return now();
+ });
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:41 AM4/20/20
to seast...@googlegroups.com, Michał Niciejewski, sa...@scylladb.com, ank...@gmail.com, wmi...@protonmail.com
From: Michał Niciejewski <qup...@gmail.com>

Each write can be divided into multiple smaller writes that can fall
into one of the following categories:
- small write: writes below SMALL_WRITE_THRESHOLD bytes, those writes
are stored fully in memory
- medium write: writes above SMALL_WRITE_THRESHOLD and below
cluster_size bytes, those writes are stored on disk, they are appended
to the on-disk data log where data from different writes can be stored
in one cluster
- big write: writes that fully fit into one cluster, stored on disk
For example, one write can be divided into multiple big writes, some
small writes and some medium writes. Current implementation won't make
any unnecessary data copying. Data given by caller is either directly
used to write to disk or is copied as a small write.

Added cluster writer which is used to perform medium writes. Cluster
writer keeps a current position in the data log and appends new data
by writing it directly into disk.

Signed-off-by: Michał Niciejewski <qup...@gmail.com>
---
src/fs/cluster_writer.hh | 85 +++++++
src/fs/metadata_disk_entries.hh | 41 ++-
src/fs/metadata_log.hh | 16 +-
src/fs/metadata_log_bootstrap.hh | 8 +
src/fs/metadata_log_operations/write.hh | 318 ++++++++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 24 ++
src/fs/metadata_log.cc | 67 ++++-
src/fs/metadata_log_bootstrap.cc | 103 ++++++++
CMakeLists.txt | 2 +
9 files changed, 660 insertions(+), 4 deletions(-)
create mode 100644 src/fs/cluster_writer.hh
create mode 100644 src/fs/metadata_log_operations/write.hh

diff --git a/src/fs/cluster_writer.hh b/src/fs/cluster_writer.hh
new file mode 100644
index 00000000..2d2ff917
--- /dev/null
+++ b/src/fs/cluster_writer.hh
@@ -0,0 +1,85 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/units.hh"
+#include "seastar/core/shared_ptr.hh"
+#include "seastar/fs/block_device.hh"
+
+#include <cstdlib>
+#include <cassert>
+
+namespace seastar::fs {
+
+// Represents buffer that will be written to a block_device. Method init() should be called just after constructor
+// in order to finish construction.
+class cluster_writer {
+protected:
+ size_t _max_size = 0;
+ unit_size_t _alignment = 0;
+ disk_offset_t _cluster_beg_offset = 0;
+ size_t _next_write_offset = 0;
+public:
+ cluster_writer() = default;
+
+ virtual shared_ptr<cluster_writer> virtual_constructor() const {
+ return make_shared<cluster_writer>();
+ }
+
+ // Total number of bytes appended cannot exceed @p aligned_max_size.
+ // @p cluster_beg_offset is the disk offset of the beginning of the cluster.
+ virtual void init(size_t aligned_max_size, unit_size_t alignment, disk_offset_t cluster_beg_offset) {
+ assert(is_power_of_2(alignment));
+ assert(mod_by_power_of_2(aligned_max_size, alignment) == 0);
+ assert(mod_by_power_of_2(cluster_beg_offset, alignment) == 0);
+
+ _max_size = aligned_max_size;
+ _alignment = alignment;
+ _cluster_beg_offset = cluster_beg_offset;
+ _next_write_offset = 0;
+ }
+
+ // Writes @p aligned_buffer to @p device just after previous write (or at @p cluster_beg_offset passed to init()
+ // if it is the first write).
+ virtual future<size_t> write(const void* aligned_buffer, size_t aligned_len, block_device device) {
+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _alignment == 0);
+ assert(aligned_len % _alignment == 0);
+ assert(aligned_len <= bytes_left());
+
+ // Make sure the writer is usable before returning from this function
+ size_t curr_write_offset = _next_write_offset;
+ _next_write_offset += aligned_len;
+
+ return device.write(_cluster_beg_offset + curr_write_offset, aligned_buffer, aligned_len);
+ }
+
+ virtual size_t bytes_left() const noexcept { return _max_size - _next_write_offset; }
+
+ // Returns disk offset of the place where the first byte of next appended bytes would be after flush
+ // TODO: maybe better name for that function? Or any other method to extract that data?
+ virtual disk_offset_t current_disk_offset() const noexcept {
+ return _cluster_beg_offset + _next_write_offset;
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 2f363a9b..4422e0b1 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -74,6 +74,10 @@ enum ondisk_type : uint8_t {
NEXT_METADATA_CLUSTER,
CREATE_INODE,
DELETE_INODE,
+ SMALL_WRITE,
+ MEDIUM_WRITE,
+ LARGE_WRITE,
+ LARGE_WRITE_WITHOUT_MTIME,
ADD_DIR_ENTRY,
CREATE_INODE_AS_DIR_ENTRY,
DELETE_DIR_ENTRY,
@@ -111,6 +115,35 @@ struct ondisk_delete_inode {
inode_t inode;
} __attribute__((packed));

+struct ondisk_small_write_header {
+ inode_t inode;
+ file_offset_t offset;
+ uint16_t length;
+ decltype(unix_metadata::mtime_ns) time_ns;
+ // After header comes data
+} __attribute__((packed));
+
+struct ondisk_medium_write {
+ inode_t inode;
+ file_offset_t offset;
+ disk_offset_t disk_offset;
+ uint32_t length;
+ decltype(unix_metadata::mtime_ns) time_ns;
+} __attribute__((packed));
+
+struct ondisk_large_write {
+ inode_t inode;
+ file_offset_t offset;
+ cluster_id_t data_cluster; // length == cluster_size
+ decltype(unix_metadata::mtime_ns) time_ns;
+} __attribute__((packed));
+
+struct ondisk_large_write_without_mtime {
+ inode_t inode;
+ file_offset_t offset;
+ cluster_id_t data_cluster; // length == cluster_size
+} __attribute__((packed));
+
struct ondisk_add_dir_entry_header {
inode_t dir_inode;
inode_t entry_inode;
@@ -142,9 +175,15 @@ template<typename T>
constexpr size_t ondisk_entry_size(const T& entry) noexcept {
static_assert(std::is_same_v<T, ondisk_next_metadata_cluster> or
std::is_same_v<T, ondisk_create_inode> or
- std::is_same_v<T, ondisk_delete_inode>, "ondisk entry size not defined for given type");
+ std::is_same_v<T, ondisk_delete_inode> or
+ std::is_same_v<T, ondisk_medium_write> or
+ std::is_same_v<T, ondisk_large_write> or
+ std::is_same_v<T, ondisk_large_write_without_mtime>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}
+constexpr size_t ondisk_entry_size(const ondisk_small_write_header& entry) noexcept {
+ return sizeof(ondisk_type) + sizeof(entry) + entry.length;
+}
constexpr size_t ondisk_entry_size(const ondisk_add_dir_entry_header& entry) noexcept {
return sizeof(ondisk_type) + sizeof(entry) + entry.entry_name_length;
}
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 721e43b8..36e16280 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -23,6 +23,7 @@

#include "fs/cluster.hh"
#include "fs/cluster_allocator.hh"
+#include "fs/cluster_writer.hh"
#include "fs/inode.hh"
#include "fs/inode_info.hh"
#include "fs/metadata_disk_entries.hh"
@@ -54,6 +55,7 @@ class metadata_log {

// Takes care of writing current cluster of serialized metadata log entries to device
shared_ptr<metadata_to_disk_buffer> _curr_cluster_buff;
+ shared_ptr<cluster_writer> _curr_data_writer;
shared_future<> _background_futures = now();

// In memory metadata
@@ -160,10 +162,11 @@ class metadata_log {
friend class create_file_operation;
friend class link_file_operation;
friend class unlink_or_remove_file_operation;
+ friend class write_operation;

public:
metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment,
- shared_ptr<metadata_to_disk_buffer> cluster_buff);
+ shared_ptr<metadata_to_disk_buffer> cluster_buff, shared_ptr<cluster_writer> data_writer);

metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment);

@@ -181,8 +184,16 @@ class metadata_log {
return _inodes.count(inode) != 0;
}

+ void write_update(inode_info::file& file, inode_data_vec new_data_vec);
+
+ // Deletes data vectors that are subset of @p data_range and cuts overlapping data vectors to make them not overlap
+ void cut_out_data_range(inode_info::file& file, file_range range);
+
inode_info& memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata);
void memory_only_delete_inode(inode_t inode);
+ void memory_only_small_write(inode_t inode, disk_offset_t offset, temporary_buffer<uint8_t> data);
+ void memory_only_disk_write(inode_t inode, file_offset_t file_offset, disk_offset_t disk_offset, size_t write_len);
+ void memory_only_update_mtime(inode_t inode, decltype(unix_metadata::mtime_ns) mtime_ns);
void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);
void memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name);

@@ -324,6 +335,9 @@ class metadata_log {

future<> close_file(inode_t inode);

+ future<size_t> write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
+ const io_priority_class& pc = default_priority_class());
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 16b429ab..03c2eb9b 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -119,6 +119,14 @@ class metadata_log_bootstrap {

future<> bootstrap_delete_inode();

+ future<> bootstrap_small_write();
+
+ future<> bootstrap_medium_write();
+
+ future<> bootstrap_large_write();
+
+ future<> bootstrap_large_write_without_mtime();
+
future<> bootstrap_add_dir_entry();

future<> bootstrap_create_inode_as_dir_entry();
diff --git a/src/fs/metadata_log_operations/write.hh b/src/fs/metadata_log_operations/write.hh
new file mode 100644
index 00000000..afe3e2ae
--- /dev/null
+++ b/src/fs/metadata_log_operations/write.hh
@@ -0,0 +1,318 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/bitwise.hh"
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/units.hh"
+#include "fs/cluster.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+#include "seastar/core/shared_ptr.hh"
+#include "seastar/core/temporary_buffer.hh"
+
+namespace seastar::fs {
+
+class write_operation {
+public:
+ // TODO: decide about threshold for small write
+ static constexpr size_t SMALL_WRITE_THRESHOLD = std::numeric_limits<decltype(ondisk_small_write_header::length)>::max();
+
+private:
+ metadata_log& _metadata_log;
+ inode_t _inode;
+ const io_priority_class& _pc;
+
+ write_operation(metadata_log& metadata_log, inode_t inode, const io_priority_class& pc)
+ : _metadata_log(metadata_log), _inode(inode), _pc(pc) {
+ assert(_metadata_log._alignment <= SMALL_WRITE_THRESHOLD and
+ "Small write threshold should be at least as big as alignment");
+ }
+
+ future<size_t> write(const uint8_t* buffer, size_t write_len, file_offset_t file_offset) {
+ auto inode_it = _metadata_log._inodes.find(_inode);
+ if (inode_it == _metadata_log._inodes.end()) {
+ return make_exception_future<size_t>(invalid_inode_exception());
+ }
+ if (inode_it->second.is_directory()) {
+ return make_exception_future<size_t>(is_directory_exception());
+ }
+
+ // TODO: maybe check if there is enough free clusters before executing?
+ return _metadata_log._locks.with_lock(metadata_log::locks::shared {_inode}, [this, buffer, write_len, file_offset] {
+ if (not _metadata_log.inode_exists(_inode)) {
+ return make_exception_future<size_t>(operation_became_invalid_exception());
+ }
+ return iterate_writes(buffer, write_len, file_offset);
+ });
+ }
+
+ future<size_t> iterate_writes(const uint8_t* buffer, size_t write_len, file_offset_t file_offset) {
+ return do_with((size_t)0, [this, buffer, write_len, file_offset](size_t& completed_write_len) {
+ return repeat([this, &completed_write_len, buffer, write_len, file_offset] {
+ if (completed_write_len == write_len) {
+ return make_ready_future<bool_class<stop_iteration_tag>>(stop_iteration::yes);
+ }
+
+ size_t remaining_write_len = write_len - completed_write_len;
+
+ size_t expected_write_len;
+ if (remaining_write_len <= SMALL_WRITE_THRESHOLD) {
+ expected_write_len = remaining_write_len;
+ } else {
+ if (auto buffer_alignment = mod_by_power_of_2(reinterpret_cast<uintptr_t>(buffer) + completed_write_len,
+ _metadata_log._alignment); buffer_alignment != 0) {
+ // When buffer is not aligned then align it using one small write
+ expected_write_len = _metadata_log._alignment - buffer_alignment;
+ } else {
+ if (remaining_write_len >= _metadata_log._cluster_size) {
+ expected_write_len = _metadata_log._cluster_size;
+ } else {
+ // If the last write is medium then align write length by splitting last write into medium aligned
+ // write and small write
+ expected_write_len = remaining_write_len;
+ }
+ }
+ }
+
+ auto shifted_buffer = buffer + completed_write_len;
+ auto shifted_file_offset = file_offset + completed_write_len;
+ auto write_future = make_ready_future<size_t>(0);
+ if (expected_write_len <= SMALL_WRITE_THRESHOLD) {
+ write_future = do_small_write(shifted_buffer, expected_write_len, shifted_file_offset);
+ } else if (expected_write_len < _metadata_log._cluster_size) {
+ write_future = medium_write(shifted_buffer, expected_write_len, shifted_file_offset);
+ } else {
+ // Update mtime only when it is the first write
+ write_future = do_large_write(shifted_buffer, shifted_file_offset, completed_write_len == 0);
+ }
+
+ return write_future.then([&completed_write_len, expected_write_len](size_t write_len) {
+ completed_write_len += write_len;
+ if (write_len != expected_write_len) {
+ return stop_iteration::yes;
+ }
+ return stop_iteration::no;
+ });
+ }).then([&completed_write_len] {
+ return make_ready_future<size_t>(completed_write_len);
+ });
+ });
+ }
+
+ static decltype(unix_metadata::mtime_ns) get_current_time_ns() {
+ using namespace std::chrono;
+ return duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ }
+
+ future<size_t> do_small_write(const uint8_t* buffer, size_t expected_write_len, file_offset_t file_offset) {
+ auto curr_time_ns = get_current_time_ns();
+ ondisk_small_write_header ondisk_entry {
+ _inode,
+ file_offset,
+ static_cast<decltype(ondisk_small_write_header::length)>(expected_write_len),
+ curr_time_ns
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry, buffer)) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future<size_t>(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ temporary_buffer<uint8_t> tmp_buffer(buffer, expected_write_len);
+ _metadata_log.memory_only_small_write(_inode, file_offset, std::move(tmp_buffer));
+ _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
+ return make_ready_future<size_t>(expected_write_len);
+ }
+ __builtin_unreachable();
+ }
+
+ future<size_t> medium_write(const uint8_t* aligned_buffer, size_t expected_write_len, file_offset_t file_offset) {
+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
+ // TODO: medium write can be divided into bigger number of smaller writes. Maybe we should add checks
+ // for that and allow only limited number of medium writes? Or we could add to to_disk_buffer option for
+ // space 'reservation' to make sure that after division our write will fit into the buffer?
+ // That would also limit medium write to at most two smaller writes.
+ return do_with((size_t)0, [this, aligned_buffer, expected_write_len, file_offset](size_t& completed_write_len) {
+ return repeat([this, &completed_write_len, aligned_buffer, expected_write_len, file_offset] {
+ if (completed_write_len == expected_write_len) {
+ return make_ready_future<bool_class<stop_iteration_tag>>(stop_iteration::yes);
+ }
+
+ size_t remaining_write_len = expected_write_len - completed_write_len;
+ size_t curr_expected_write_len;
+ auto shifted_buffer = aligned_buffer + completed_write_len;
+ auto shifted_file_offset = file_offset + completed_write_len;
+ auto write_future = make_ready_future<size_t>(0);
+ if (remaining_write_len <= SMALL_WRITE_THRESHOLD) {
+ // We can use small write for the remaining data
+ curr_expected_write_len = remaining_write_len;
+ write_future = do_small_write(shifted_buffer, curr_expected_write_len, shifted_file_offset);
+ } else {
+ size_t rounded_remaining_write_len =
+ round_down_to_multiple_of_power_of_2(remaining_write_len, _metadata_log._alignment);
+
+ // We must use medium write
+ size_t buff_bytes_left = _metadata_log._curr_data_writer->bytes_left();
+ if (buff_bytes_left <= SMALL_WRITE_THRESHOLD) {
+ // TODO: add wasted buff_bytes_left bytes for compaction
+ // No space left in the current to_disk_buffer for medium write - allocate a new buffer
+ std::optional<cluster_id_t> cluster_opt = _metadata_log._cluster_allocator.alloc();
+ if (not cluster_opt) {
+ // TODO: maybe we should return partial write instead of exception?
+ return make_exception_future<bool_class<stop_iteration_tag>>(no_more_space_exception());
+ }
+
+ auto cluster_id = cluster_opt.value();
+ disk_offset_t cluster_disk_offset = cluster_id_to_offset(cluster_id, _metadata_log._cluster_size);
+ _metadata_log._curr_data_writer = _metadata_log._curr_data_writer->virtual_constructor();
+ _metadata_log._curr_data_writer->init(_metadata_log._cluster_size, _metadata_log._alignment,
+ cluster_disk_offset);
+ buff_bytes_left = _metadata_log._curr_data_writer->bytes_left();
+
+ curr_expected_write_len = rounded_remaining_write_len;
+ } else {
+ // There is enough space for medium write
+ curr_expected_write_len = buff_bytes_left >= rounded_remaining_write_len ?
+ rounded_remaining_write_len : buff_bytes_left;
+ }
+
+ write_future = do_medium_write(shifted_buffer, curr_expected_write_len, shifted_file_offset,
+ _metadata_log._curr_data_writer);
+ }
+
+ return write_future.then([&completed_write_len, curr_expected_write_len](size_t write_len) {
+ completed_write_len += write_len;
+ if (write_len != curr_expected_write_len) {
+ return stop_iteration::yes;
+ }
+ return stop_iteration::no;
+ });
+ }).then([&completed_write_len] {
+ return make_ready_future<size_t>(completed_write_len);
+ });;
+ });
+ }
+
+ future<size_t> do_medium_write(const uint8_t* aligned_buffer, size_t aligned_expected_write_len, file_offset_t file_offset,
+ shared_ptr<cluster_writer> disk_buffer) {
+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
+ assert(aligned_expected_write_len % _metadata_log._alignment == 0);
+ assert(disk_buffer->bytes_left() >= aligned_expected_write_len);
+
+ disk_offset_t device_offset = disk_buffer->current_disk_offset();
+ return disk_buffer->write(aligned_buffer, aligned_expected_write_len, _metadata_log._device).then(
+ [this, file_offset, disk_buffer = std::move(disk_buffer), device_offset](size_t write_len) {
+ // TODO: is this round down necessary?
+ // On partial write return aligned write length
+ write_len = round_down_to_multiple_of_power_of_2(write_len, _metadata_log._alignment);
+
+ auto curr_time_ns = get_current_time_ns();
+ ondisk_medium_write ondisk_entry {
+ _inode,
+ file_offset,
+ device_offset,
+ static_cast<decltype(ondisk_medium_write::length)>(write_len),
+ curr_time_ns
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future<size_t>(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ _metadata_log.memory_only_disk_write(_inode, file_offset, device_offset, write_len);
+ _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
+ return make_ready_future<size_t>(write_len);
+ }
+ __builtin_unreachable();
+ });
+ }
+
+ future<size_t> do_large_write(const uint8_t* aligned_buffer, file_offset_t file_offset, bool update_mtime) {
+ assert(reinterpret_cast<uintptr_t>(aligned_buffer) % _metadata_log._alignment == 0);
+ // aligned_expected_write_len = _metadata_log._cluster_size
+ std::optional<cluster_id_t> cluster_opt = _metadata_log._cluster_allocator.alloc();
+ if (not cluster_opt) {
+ return make_exception_future<size_t>(no_more_space_exception());
+ }
+ auto cluster_id = cluster_opt.value();
+ disk_offset_t cluster_disk_offset = cluster_id_to_offset(cluster_id, _metadata_log._cluster_size);
+
+ return _metadata_log._device.write(cluster_disk_offset, aligned_buffer, _metadata_log._cluster_size, _pc).then(
+ [this, file_offset, cluster_id, cluster_disk_offset, update_mtime](size_t write_len) {
+ if (write_len != _metadata_log._cluster_size) {
+ _metadata_log._cluster_allocator.free(cluster_id);
+ return make_ready_future<size_t>(0);
+ }
+
+ metadata_log::append_result append_result;
+ if (update_mtime) {
+ auto curr_time_ns = get_current_time_ns();
+ ondisk_large_write ondisk_entry {
+ _inode,
+ file_offset,
+ cluster_id,
+ curr_time_ns
+ };
+ append_result = _metadata_log.append_ondisk_entry(ondisk_entry);
+ if (append_result == metadata_log::append_result::APPENDED) {
+ _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
+ }
+ } else {
+ ondisk_large_write_without_mtime ondisk_entry {
+ _inode,
+ file_offset,
+ cluster_id
+ };
+ append_result = _metadata_log.append_ondisk_entry(ondisk_entry);
+ }
+
+ switch (append_result) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future<size_t>(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ _metadata_log._cluster_allocator.free(cluster_id);
+ return make_exception_future<size_t>(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ _metadata_log.memory_only_disk_write(_inode, file_offset, cluster_disk_offset, write_len);
+ return make_ready_future<size_t>(write_len);
+ }
+ __builtin_unreachable();
+ });
+ }
+
+public:
+ static future<size_t> perform(metadata_log& metadata_log, inode_t inode, file_offset_t pos, const void* buffer,
+ size_t len, const io_priority_class& pc) {
+ return do_with(write_operation(metadata_log, inode, pc), [buffer, len, pos](auto& obj) {
+ return obj.write(static_cast<const uint8_t*>(buffer), len, pos);
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index 979a03c2..6a71d96e 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -161,6 +161,30 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(DELETE_INODE, delete_inode);
}

+ [[nodiscard]] virtual append_result append(const ondisk_small_write_header& small_write, const void* data) noexcept {
+ ondisk_type type = SMALL_WRITE;
+ if (not fits_for_append(ondisk_entry_size(small_write))) {
+ return TOO_BIG;
+ }
+
+ append_bytes(&type, sizeof(type));
+ append_bytes(&small_write, sizeof(small_write));
+ append_bytes(data, small_write.length);
+ return APPENDED;
+ }
+
+ [[nodiscard]] virtual append_result append(const ondisk_medium_write& medium_write) noexcept {
+ return append_simple(MEDIUM_WRITE, medium_write);
+ }
+
+ [[nodiscard]] virtual append_result append(const ondisk_large_write& large_write) noexcept {
+ return append_simple(LARGE_WRITE, large_write);
+ }
+
+ [[nodiscard]] virtual append_result append(const ondisk_large_write_without_mtime& large_write_without_mtime) noexcept {
+ return append_simple(LARGE_WRITE_WITHOUT_MTIME, large_write_without_mtime);
+ }
+
[[nodiscard]] virtual append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept {
ondisk_type type = ADD_DIR_ENTRY;
if (not fits_for_append(ondisk_entry_size(add_dir_entry))) {
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 56954cf1..70434a4a 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -30,6 +30,7 @@
#include "fs/metadata_log_operations/create_file.hh"
#include "fs/metadata_log_operations/link_file.hh"
#include "fs/metadata_log_operations/unlink_or_remove_file.hh"
+#include "fs/metadata_log_operations/write.hh"
#include "fs/metadata_to_disk_buffer.hh"
#include "fs/path.hh"
#include "fs/units.hh"
@@ -57,11 +58,12 @@
namespace seastar::fs {

metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t alignment,
- shared_ptr<metadata_to_disk_buffer> cluster_buff)
+ shared_ptr<metadata_to_disk_buffer> cluster_buff, shared_ptr<cluster_writer> data_writer)
: _device(std::move(device))
, _cluster_size(cluster_size)
, _alignment(alignment)
, _curr_cluster_buff(std::move(cluster_buff))
+, _curr_data_writer(std::move(data_writer))
, _cluster_allocator({}, {})
, _inode_allocator(1, 0) {
assert(is_power_of_2(alignment));
@@ -70,7 +72,7 @@ metadata_log::metadata_log(block_device device, uint32_t cluster_size, uint32_t

metadata_log::metadata_log(block_device device, unit_size_t cluster_size, unit_size_t alignment)
: metadata_log(std::move(device), cluster_size, alignment,
- make_shared<metadata_to_disk_buffer>()) {}
+ make_shared<metadata_to_disk_buffer>(), make_shared<cluster_writer>()) {}

future<> metadata_log::bootstrap(inode_t root_dir, cluster_id_t first_metadata_cluster_id, cluster_range available_clusters,
fs_shard_id_t fs_shards_pool_size, fs_shard_id_t fs_shard_id) {
@@ -84,6 +86,27 @@ future<> metadata_log::shutdown() {
});
}

+void metadata_log::write_update(inode_info::file& file, inode_data_vec new_data_vec) {
+ // TODO: for compaction: update used inode_data_vec
+ auto file_size = file.size();
+ if (file_size < new_data_vec.data_range.beg) {
+ file.data.emplace(file_size, inode_data_vec {
+ {file_size, new_data_vec.data_range.beg},
+ inode_data_vec::hole_data {}
+ });
+ } else {
+ cut_out_data_range(file, new_data_vec.data_range);
+ }
+
+ file.data.emplace(new_data_vec.data_range.beg, std::move(new_data_vec));
+}
+
+void metadata_log::cut_out_data_range(inode_info::file& file, file_range range) {
+ file.cut_out_data_range(range, [](inode_data_vec data_vec) {
+ (void)data_vec; // TODO: for compaction: update used inode_data_vec
+ });
+}
+
inode_info& metadata_log::memory_only_create_inode(inode_t inode, bool is_directory, unix_metadata metadata) {
assert(_inodes.count(inode) == 0);
return _inodes.emplace(inode, inode_info {
@@ -118,6 +141,41 @@ void metadata_log::memory_only_delete_inode(inode_t inode) {
_inodes.erase(it);
}

+void metadata_log::memory_only_small_write(inode_t inode, file_offset_t file_offset, temporary_buffer<uint8_t> data) {
+ inode_data_vec data_vec = {
+ {file_offset, file_offset + data.size()},
+ inode_data_vec::in_mem_data {std::move(data)}
+ };
+
+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());
+ assert(it->second.is_file());
+ write_update(it->second.get_file(), std::move(data_vec));
+}
+
+void metadata_log::memory_only_disk_write(inode_t inode, file_offset_t file_offset, disk_offset_t disk_offset,
+ size_t write_len) {
+ inode_data_vec data_vec = {
+ {file_offset, file_offset + write_len},
+ inode_data_vec::on_disk_data {disk_offset}
+ };
+
+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());
+ assert(it->second.is_file());
+ write_update(it->second.get_file(), std::move(data_vec));
+}
+
+void metadata_log::memory_only_update_mtime(inode_t inode, decltype(unix_metadata::mtime_ns) mtime_ns) {
+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());
+ it->second.metadata.mtime_ns = mtime_ns;
+ // ctime should be updated when contents is modified
+ if (it->second.metadata.ctime_ns < mtime_ns) {
+ it->second.metadata.ctime_ns = mtime_ns;
+ }
+}
+
void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {
auto it = _inodes.find(entry_inode);
assert(it != _inodes.end());
@@ -381,6 +439,11 @@ future<> metadata_log::close_file(inode_t inode) {
});
}

+future<size_t> metadata_log::write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
+ const io_priority_class& pc) {
+ return write_operation::perform(*this, inode, pos, buffer, len, pc);
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 3120fbd4..52354181 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -111,8 +111,13 @@ future<> metadata_log_bootstrap::bootstrap(cluster_id_t first_metadata_cluster_i
if (free_clusters.empty()) {
return make_exception_future(no_more_space_exception());
}
+ cluster_id_t datalog_cluster_id = free_clusters.front();
free_clusters.pop_front();

+ _metadata_log._curr_data_writer = _metadata_log._curr_data_writer->virtual_constructor();
+ _metadata_log._curr_data_writer->init(_metadata_log._cluster_size, _metadata_log._alignment,
+ cluster_id_to_offset(datalog_cluster_id, _metadata_log._cluster_size));
+
mlogger.debug("free clusters: {}", free_clusters.size());
_metadata_log._cluster_allocator = cluster_allocator(std::move(_taken_clusters), std::move(free_clusters));

@@ -215,6 +220,14 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_create_inode();
case DELETE_INODE:
return bootstrap_delete_inode();
+ case SMALL_WRITE:
+ return bootstrap_small_write();
+ case MEDIUM_WRITE:
+ return bootstrap_medium_write();
+ case LARGE_WRITE:
+ return bootstrap_large_write();
+ case LARGE_WRITE_WITHOUT_MTIME:
+ return bootstrap_large_write_without_mtime();
case ADD_DIR_ENTRY:
return bootstrap_add_dir_entry();
case CREATE_INODE_AS_DIR_ENTRY:
@@ -284,6 +297,96 @@ future<> metadata_log_bootstrap::bootstrap_delete_inode() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_small_write() {
+ ondisk_small_write_header entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.inode].is_file()) {
+ return invalid_entry_exception();
+ }
+
+ auto data_opt = _curr_checkpoint.read_tmp_buff(entry.length);
+ if (not data_opt) {
+ return invalid_entry_exception();
+ }
+ temporary_buffer<uint8_t>& data = *data_opt;
+
+ _metadata_log.memory_only_small_write(entry.inode, entry.offset, std::move(data));
+ _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);
+ return now();
+}
+
+future<> metadata_log_bootstrap::bootstrap_medium_write() {
+ ondisk_medium_write entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.inode].is_file()) {
+ return invalid_entry_exception();
+ }
+
+ cluster_id_t data_cluster_id = offset_to_cluster_id(entry.disk_offset, _metadata_log._cluster_size);
+ if (_available_clusters.beg > data_cluster_id or
+ _available_clusters.end <= data_cluster_id) {
+ return invalid_entry_exception();
+ }
+ // TODO: we could check overlapping with other writes
+ _taken_clusters.emplace(data_cluster_id);
+
+ _metadata_log.memory_only_disk_write(entry.inode, entry.offset, entry.disk_offset, entry.length);
+ _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);
+ return now();
+}
+
+future<> metadata_log_bootstrap::bootstrap_large_write() {
+ ondisk_large_write entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.inode].is_file()) {
+ return invalid_entry_exception();
+ }
+
+ if (_available_clusters.beg > entry.data_cluster or
+ _available_clusters.end <= entry.data_cluster or
+ _taken_clusters.count(entry.data_cluster) != 0) {
+ return invalid_entry_exception();
+ }
+ _taken_clusters.emplace((cluster_id_t)entry.data_cluster);
+
+ _metadata_log.memory_only_disk_write(entry.inode, entry.offset,
+ cluster_id_to_offset(entry.data_cluster, _metadata_log._cluster_size), _metadata_log._cluster_size);
+ _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);
+ return now();
+}
+
+// TODO: copy pasting :(
+future<> metadata_log_bootstrap::bootstrap_large_write_without_mtime() {
+ ondisk_large_write_without_mtime entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.inode].is_file()) {
+ return invalid_entry_exception();
+ }
+
+ if (_available_clusters.beg > entry.data_cluster or
+ _available_clusters.end <= entry.data_cluster or
+ _taken_clusters.count(entry.data_cluster) != 0) {
+ return invalid_entry_exception();
+ }
+ _taken_clusters.emplace((cluster_id_t)entry.data_cluster);
+
+ _metadata_log.memory_only_disk_write(entry.inode, entry.offset,
+ cluster_id_to_offset(entry.data_cluster, _metadata_log._cluster_size), _metadata_log._cluster_size);
+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap_add_dir_entry() {
ondisk_add_dir_entry_header entry;
if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
diff --git a/CMakeLists.txt b/CMakeLists.txt
index e432e572..840a02aa 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -668,6 +668,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/cluster.hh
src/fs/cluster_allocator.cc
src/fs/cluster_allocator.hh
+ src/fs/cluster_writer.hh
src/fs/crc.hh
src/fs/file.cc
src/fs/inode.hh
@@ -681,6 +682,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_operations/create_file.hh
src/fs/metadata_log_operations/link_file.hh
src/fs/metadata_log_operations/unlink_or_remove_file.hh
+ src/fs/metadata_log_operations/write.hh

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:41 AM4/20/20
to seast...@googlegroups.com, Wojciech Mitros, sa...@scylladb.com, ank...@gmail.com, qup...@gmail.com
From: Wojciech Mitros <wmi...@protonmail.com>

Truncate can be used on a file to change its size. When the new
size is lower than current, the data at higher offsets will be lost,
and when it's larger, the file will be filled with null bytes.

Signed-off-by: Wojciech Mitros <wmi...@protonmail.com>
---
src/fs/metadata_disk_entries.hh | 10 ++-
src/fs/metadata_log.hh | 5 ++
src/fs/metadata_log_bootstrap.hh | 2 +
src/fs/metadata_log_operations/truncate.hh | 90 ++++++++++++++++++++++
src/fs/metadata_to_disk_buffer.hh | 4 +
src/fs/metadata_log.cc | 26 +++++++
src/fs/metadata_log_bootstrap.cc | 17 ++++
CMakeLists.txt | 1 +
8 files changed, 154 insertions(+), 1 deletion(-)
create mode 100644 src/fs/metadata_log_operations/truncate.hh

diff --git a/src/fs/metadata_disk_entries.hh b/src/fs/metadata_disk_entries.hh
index 4422e0b1..8c9f0499 100644
--- a/src/fs/metadata_disk_entries.hh
+++ b/src/fs/metadata_disk_entries.hh
@@ -78,6 +78,7 @@ enum ondisk_type : uint8_t {
MEDIUM_WRITE,
LARGE_WRITE,
LARGE_WRITE_WITHOUT_MTIME,
+ TRUNCATE,
ADD_DIR_ENTRY,
CREATE_INODE_AS_DIR_ENTRY,
DELETE_DIR_ENTRY,
@@ -144,6 +145,12 @@ struct ondisk_large_write_without_mtime {
cluster_id_t data_cluster; // length == cluster_size
} __attribute__((packed));

+struct ondisk_truncate {
+ inode_t inode;
+ file_offset_t size;
+ decltype(unix_metadata::mtime_ns) time_ns;
+} __attribute__((packed));
+
struct ondisk_add_dir_entry_header {
inode_t dir_inode;
inode_t entry_inode;
@@ -178,7 +185,8 @@ constexpr size_t ondisk_entry_size(const T& entry) noexcept {
std::is_same_v<T, ondisk_delete_inode> or
std::is_same_v<T, ondisk_medium_write> or
std::is_same_v<T, ondisk_large_write> or
- std::is_same_v<T, ondisk_large_write_without_mtime>, "ondisk entry size not defined for given type");
+ std::is_same_v<T, ondisk_large_write_without_mtime> or
+ std::is_same_v<T, ondisk_truncate>, "ondisk entry size not defined for given type");
return sizeof(ondisk_type) + sizeof(entry);
}
constexpr size_t ondisk_entry_size(const ondisk_small_write_header& entry) noexcept {
diff --git a/src/fs/metadata_log.hh b/src/fs/metadata_log.hh
index 36e16280..1ee29842 100644
--- a/src/fs/metadata_log.hh
+++ b/src/fs/metadata_log.hh
@@ -161,6 +161,7 @@ class metadata_log {
friend class create_and_open_unlinked_file_operation;
friend class create_file_operation;
friend class link_file_operation;
+ friend class truncate_operation;
friend class unlink_or_remove_file_operation;
friend class write_operation;

@@ -194,6 +195,7 @@ class metadata_log {
void memory_only_small_write(inode_t inode, disk_offset_t offset, temporary_buffer<uint8_t> data);
void memory_only_disk_write(inode_t inode, file_offset_t file_offset, disk_offset_t disk_offset, size_t write_len);
void memory_only_update_mtime(inode_t inode, decltype(unix_metadata::mtime_ns) mtime_ns);
+ void memory_only_truncate(inode_t inode, disk_offset_t size);
void memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name);
void memory_only_delete_dir_entry(inode_info::directory& dir, std::string entry_name);

@@ -338,6 +340,9 @@ class metadata_log {
future<size_t> write(inode_t inode, file_offset_t pos, const void* buffer, size_t len,
const io_priority_class& pc = default_priority_class());

+ // Truncates a file or or extends it with a "hole" data_vec to a specified size
+ future<> truncate(inode_t inode, file_offset_t size);
+
// All disk-related errors will be exposed here
future<> flush_log() {
return flush_curr_cluster();
diff --git a/src/fs/metadata_log_bootstrap.hh b/src/fs/metadata_log_bootstrap.hh
index 03c2eb9b..5c3584da 100644
--- a/src/fs/metadata_log_bootstrap.hh
+++ b/src/fs/metadata_log_bootstrap.hh
@@ -127,6 +127,8 @@ class metadata_log_bootstrap {

future<> bootstrap_large_write_without_mtime();

+ future<> bootstrap_truncate();
+
future<> bootstrap_add_dir_entry();

future<> bootstrap_create_inode_as_dir_entry();
diff --git a/src/fs/metadata_log_operations/truncate.hh b/src/fs/metadata_log_operations/truncate.hh
new file mode 100644
index 00000000..abd7d158
--- /dev/null
+++ b/src/fs/metadata_log_operations/truncate.hh
@@ -0,0 +1,90 @@
+/*
+ * This file is open source software, licensed to you under the terms
+ * of the Apache License, Version 2.0 (the "License"). See the NOTICE file
+ * distributed with this work for additional information regarding copyright
+ * ownership. You may not use this file except in compliance with the License.
+ *
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+/*
+ * Copyright (C) 2020 ScyllaDB
+ */
+
+#pragma once
+
+#include "fs/inode.hh"
+#include "fs/inode_info.hh"
+#include "fs/metadata_disk_entries.hh"
+#include "fs/metadata_log.hh"
+#include "fs/units.hh"
+#include "seastar/core/future-util.hh"
+#include "seastar/core/future.hh"
+
+namespace seastar::fs {
+
+class truncate_operation {
+
+ metadata_log& _metadata_log;
+ inode_t _inode;
+
+ truncate_operation(metadata_log& metadata_log, inode_t inode)
+ : _metadata_log(metadata_log), _inode(inode) {
+ }
+
+ future<> truncate(file_offset_t size) {
+ auto inode_it = _metadata_log._inodes.find(_inode);
+ if (inode_it == _metadata_log._inodes.end()) {
+ return make_exception_future(invalid_inode_exception());
+ }
+ if (inode_it->second.is_directory()) {
+ return make_exception_future(is_directory_exception());
+ }
+
+ return _metadata_log._locks.with_lock(metadata_log::locks::shared {_inode}, [this, size] {
+ if (not _metadata_log.inode_exists(_inode)) {
+ return make_exception_future(operation_became_invalid_exception());
+ }
+ return do_truncate(size);
+ });
+ }
+
+ future<> do_truncate(file_offset_t size) {
+ using namespace std::chrono;
+ uint64_t curr_time_ns = duration_cast<nanoseconds>(system_clock::now().time_since_epoch()).count();
+ ondisk_truncate ondisk_entry {
+ _inode,
+ size,
+ curr_time_ns
+ };
+
+ switch (_metadata_log.append_ondisk_entry(ondisk_entry)) {
+ case metadata_log::append_result::TOO_BIG:
+ return make_exception_future(cluster_size_too_small_to_perform_operation_exception());
+ case metadata_log::append_result::NO_SPACE:
+ return make_exception_future(no_more_space_exception());
+ case metadata_log::append_result::APPENDED:
+ _metadata_log.memory_only_truncate(_inode, size);
+ _metadata_log.memory_only_update_mtime(_inode, curr_time_ns);
+ return make_ready_future();
+ }
+ __builtin_unreachable();
+ }
+
+public:
+ static future<> perform(metadata_log& metadata_log, inode_t inode, file_offset_t size) {
+ return do_with(truncate_operation(metadata_log, inode), [size](auto& obj) {
+ return obj.truncate(size);
+ });
+ }
+};
+
+} // namespace seastar::fs
diff --git a/src/fs/metadata_to_disk_buffer.hh b/src/fs/metadata_to_disk_buffer.hh
index 6a71d96e..714d9f59 100644
--- a/src/fs/metadata_to_disk_buffer.hh
+++ b/src/fs/metadata_to_disk_buffer.hh
@@ -185,6 +185,10 @@ class metadata_to_disk_buffer : protected to_disk_buffer {
return append_simple(LARGE_WRITE_WITHOUT_MTIME, large_write_without_mtime);
}

+ [[nodiscard]] virtual append_result append(const ondisk_truncate& truncate) noexcept {
+ return append_simple(TRUNCATE, truncate);
+ }
+
[[nodiscard]] virtual append_result append(const ondisk_add_dir_entry_header& add_dir_entry, const void* entry_name) noexcept {
ondisk_type type = ADD_DIR_ENTRY;
if (not fits_for_append(ondisk_entry_size(add_dir_entry))) {
diff --git a/src/fs/metadata_log.cc b/src/fs/metadata_log.cc
index 70434a4a..18f52dfc 100644
--- a/src/fs/metadata_log.cc
+++ b/src/fs/metadata_log.cc
@@ -29,6 +29,7 @@
#include "fs/metadata_log_operations/create_and_open_unlinked_file.hh"
#include "fs/metadata_log_operations/create_file.hh"
#include "fs/metadata_log_operations/link_file.hh"
+#include "fs/metadata_log_operations/truncate.hh"
#include "fs/metadata_log_operations/unlink_or_remove_file.hh"
#include "fs/metadata_log_operations/write.hh"
#include "fs/metadata_to_disk_buffer.hh"
@@ -176,6 +177,27 @@ void metadata_log::memory_only_update_mtime(inode_t inode, decltype(unix_metadat
}
}

+void metadata_log::memory_only_truncate(inode_t inode, file_offset_t size) {
+ auto it = _inodes.find(inode);
+ assert(it != _inodes.end());
+ assert(it->second.is_file());
+ auto& file = it->second.get_file();
+
+ auto file_size = file.size();
+ if (size > file_size) {
+ file.data.emplace(file_size, inode_data_vec {
+ {file_size, size},
+ inode_data_vec::hole_data {}
+ });
+ } else {
+ // TODO: for compaction: update used inode_data_vec
+ cut_out_data_range(file, {
+ size,
+ std::numeric_limits<decltype(file_range::end)>::max()
+ });
+ }
+}
+
void metadata_log::memory_only_add_dir_entry(inode_info::directory& dir, inode_t entry_inode, std::string entry_name) {
auto it = _inodes.find(entry_inode);
assert(it != _inodes.end());
@@ -444,6 +466,10 @@ future<size_t> metadata_log::write(inode_t inode, file_offset_t pos, const void*
return write_operation::perform(*this, inode, pos, buffer, len, pc);
}

+future<> metadata_log::truncate(inode_t inode, file_offset_t size) {
+ return truncate_operation::perform(*this, inode, size);
+}
+
// TODO: think about how to make filesystem recoverable from ENOSPACE situation: flush() (or something else) throws ENOSPACE,
// then it should be possible to compact some data (e.g. by truncating a file) via top-level interface and retrying the flush()
// without a ENOSPACE error. In particular if we delete all files after ENOSPACE it should be successful. It becomes especially
diff --git a/src/fs/metadata_log_bootstrap.cc b/src/fs/metadata_log_bootstrap.cc
index 52354181..5e3b74e4 100644
--- a/src/fs/metadata_log_bootstrap.cc
+++ b/src/fs/metadata_log_bootstrap.cc
@@ -228,6 +228,8 @@ future<> metadata_log_bootstrap::bootstrap_checkpointed_data() {
return bootstrap_large_write();
case LARGE_WRITE_WITHOUT_MTIME:
return bootstrap_large_write_without_mtime();
+ case TRUNCATE:
+ return bootstrap_truncate();
case ADD_DIR_ENTRY:
return bootstrap_add_dir_entry();
case CREATE_INODE_AS_DIR_ENTRY:
@@ -387,6 +389,21 @@ future<> metadata_log_bootstrap::bootstrap_large_write_without_mtime() {
return now();
}

+future<> metadata_log_bootstrap::bootstrap_truncate() {
+ ondisk_truncate entry;
+ if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.inode)) {
+ return invalid_entry_exception();
+ }
+
+ if (not _metadata_log._inodes[entry.inode].is_file()) {
+ return invalid_entry_exception();
+ }
+
+ _metadata_log.memory_only_truncate(entry.inode, entry.size);
+ _metadata_log.memory_only_update_mtime(entry.inode, entry.time_ns);
+ return now();
+}
+
future<> metadata_log_bootstrap::bootstrap_add_dir_entry() {
ondisk_add_dir_entry_header entry;
if (not _curr_checkpoint.read_entry(entry) or not inode_exists(entry.dir_inode) or
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 840a02aa..b6c8ef3a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -681,6 +681,7 @@ if (Seastar_EXPERIMENTAL_FS)
src/fs/metadata_log_operations/create_and_open_unlinked_file.hh
src/fs/metadata_log_operations/create_file.hh
src/fs/metadata_log_operations/link_file.hh
+ src/fs/metadata_log_operations/truncate.hh
src/fs/metadata_log_operations/unlink_or_remove_file.hh
src/fs/metadata_log_operations/write.hh
src/fs/metadata_to_disk_buffer.hh
--
2.26.1

Krzysztof Małysa

unread,
Apr 20, 2020, 8:02:42 AM4/20/20