[RELEASE] Scylla 4.5 RC1

125 views

Skip to first unread message

Tzach Livyatan

<tzach@scylladb.com>

unread,

Apr 29, 2021, 3:33:38 PM4/29/21

to ScyllaDB users, scylladb-dev

The Scylla team is pleased to announce Scylla Open Source 4.5 RC1, the first Release Candidate for the Scylla Open Source 4.5 minor release. Moving forward, we will only fix critical bugs in branch-4.5. We will continue to fix bugs and add features to the master branch.

Scylla 4.5 includes a new restore (load and stream) operation, Alternator support for CORS and many other performance, stability improvements and bug fixes (below). Find the Scylla Open Source 4.5 repository for your Linux distribution here. Scylla 4.5 RC1 Docker is also available.

Use the release candidate with caution; RC1 is not production-ready yet. You can help stabilize Scylla Open Source 4.5 by reporting bugs here.

Only the last two minor releases of Scylla Open Source project are supported. Once Scylla Open Source 4.5 is officially released, only Scylla Open Source 4.5 and Scylla 4.4 will be supported, and Scylla 4.3 will be retired.

New features in Scylla 4.5

Load and stream

This feature extends nodetool refresh to allow loading arbitrary sstables

that do not belong to a node into the cluster. It loads the sstables from the disk and calculates the data's owning nodes, and streams

automatically.

For example, say the old cluster has 6 nodes and the new cluster has 3 nodes.

We can copy the sstables from the old cluster to any of the new nodes and

trigger the load and stream process.

This can make restores and migrations much easier:

You can place sstable from every node to every node
No need to run nodetool cleanup to remove unused data

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

Alternator

Support for Cross-Origin Resource Sharing (CORS). This allows client browsers to access the database directly via JavaScript, avoiding the middle tier. #8025
Support limiting the number of concurrent requests with a scylla.yaml configuration value max_concurrent_requests_per_shard. In case the limit is crossed, Alternator will return RequestLimitExceeded error type (compatible with DynamoDB API) #7294
Alternator, Scylla's implementation of a dynamodb-compatible API now fully supports nested attribute paths. Nested attribute processing happens when an item's attribute is itself an object, and an operation modifies just one of the object's attributes instead of the entire object. #5024 #8043
Alternator now supports slow query logging capability. Queries that last longer than the specified threshold are logged in `system_traces.node_slow_log` and traced. #8292
Alternator was changed to avoid large contiguous allocations for large requests. Instead, the allocation will be broken up into smaller chunks. This reduces stress on the allocator, and therefore latency. #7213

sstableloader now work with Alternator tables #8229
Support attribute paths in ConditionExpression, FilterExpression
Support attribute paths in ProjectionExpression

Raft

We are building up an internal service in Scylla, useful for this and other applications. The changes have no visible effect yet. Among other, the following was added:

Scylla stores the database schema in a set of tables. Previously, these tables were sharded across all cores like ordinary user tables. They are now maintained by shard 0 alone. This is a step towards letting Raft manage them, since raft needs to atomically modify the schema tables, and this can't be done if the data is distributed on many cores. #7947
Raft can now store its log data in a system table. Raft is implemented in a modular fashion with plug-ins implementing various parts; this is a persistence module.
Raft Joint Consensus has been merged. This is the ability to change a raft group from one set of nodes to another, needed to change cluster topology or to migrate data to different nodes.
Raft now integrates with the Scylla RPC subsystem; raft itself is modular and requires integration with the various Scylla service providers.
The Raft implementation gained support for non-voting nodes. This is used to make membership changes less disruptive.
The Raft implementation now has a per-server timer, used for Raft keepalives
The Raft implementation gained support for leader step down. This improves availability when a node is taken down for planned maintenance

Deployment and packaging

It is now possible to install Scylla on SUSE Linux Enterprise Server using the standard RPM packages. #8277
The setup utility now uses chrony instead of ntp for timekeeping on all Linux distributions. This makes setup more regular. #7922
Dynamic setting of aio-max-nr based on the number of cpus, mostly needed for large machines like EC2 i3en.24xlarge #8133

CDC

The Change Data Capture (CDC) facility used a collection to store information about CDC log streams. Since large clusters can have many streams, this violation of Scylla guidelines caused many latency problems. Two steps were taken to correct it: the number of streams were limited (with some loss in efficiency on large clusters), and a new format was introduced (with automatic transition code) that uses partitions and clustering rows instead of collections. #7993

Tools and APIs

It is now possible to perform a partial repair when a node is missing, by using the new ignore_nodes option. Repair will also detect when a repair range has no live nodes to repair with and short-circuit the operation #7806 #8256
The Thrift API disable by default. As it is less often used, users might no be aware Thrift is open and might be a security risk. #8336. To enable it, add “start_rpc: true” to scylla.yaml. In addition, Thrift now have

partial admission control
support for max_concurrent_requests_per_shard
counters for in-flight requests and blocked requests

Nodetool Top Partitions extension. nodetool toppartitions allow you to find the partitions with the highest read and write access in the last time window. Till now, nodetool toppartitions only supported one table at a time. From Scylla 4.5, nodetool toppartitions allows specifying a list of tables, or keyspaces. #4520

nodetool stop now supports all compaction types: Supported types are: COMPACTION, CLEANUP, VALIDATION, SCRUB, INDEX_BUILD, RESHARD, UPGRADE, RESHAPE. For example: nodetool stop SCRUB. Note that reshard and reshape starts automatically on boot or refresh, if needed. Compaction, Cleanup, Scrub, and Upgrade are started with nodetool command. the others (VALIDATION, INDEX_BUILD) are unsupported by nodetool stop.
scylla_setup option to retry the RAID setup #8174
New system/drop_sstable_caches RESTful API. Evicts objects from caches that reflect sstable content, like the row cache. In the future, it will also drop the page cache and sstable index caches. While exiting BYPASS CACHE affects the behavior of a given CQL query on per-query basis, this API clears the cache at the time of invocation, later queries will populate it.
REST API: add the compaction id to the response of GET compaction_manager/compactions

Lightweight (fast) slow queries logging mode. New, low overhead, for slow queries tracing. When enabled, it will work in the same way slow query tracing does besides that it will omit recording all the tracing events. So that it will not populate data to the system_traces.events table but it will populate trace session records for slow queries to all the rest: system_traces.sessions, system_traces.node_slow_log, etc. #2572 . More here

Performance Optimizations

Improve flat_mutation_reader::consume_pausable #8359. Combined reader microbenchmark has shown from 2% to 22% improvement in median execution time while memtable microbenchmark has shown from 3.6% to 7.8% improvement in median execution time.
Significant write amplification when reshaping level 0 in a LCS table #8345
The Log-Structured Allocator (LSA) is the underlying memory allocator behind Scylla's cache and memtables. When memory runs out, it is called to evict objects from cache, and to defragement free memory, in order to serve new allocation requests. If memory was especially fragmented, or if the allocation request was large, this could take a long while, causing a latency spike. To combat this, a new background reclaim service is added which evicts and defragments memory ahead of time and maintains a watermark of free, non-fragmented memory from which allocations can be satisfied quickly. This is somewhat similar to kswapd on Linux. #1634
To store cells in rows, Scylla used a combination of a vector (for short rows) and red-black tree (for wide rows), switching between the representations dynamically. The red-black is inefficient in memory footprint when many cells are present, so the data storage now uses a radix tree exclusively. This both reduces the memory footprint and also improves efficiency.
sstables: Share partition index pages between readers. Before this patch, each index reader had its own cache of partition index pages. Now there is a shared cache, owned by the sstable object. This allows concurrent reads to share partition index pages and thus reduce the amount of I/O. For IO-bound, we needed 2 I/O per read before, and 1 (amortized) now. The throughput is ~70% higher. More
Switch partition rows onto B-tree. The data type for storing rows inside a partition was changed from a red-black tree to a B-tree. This saves space and spares some cpu cycles. More here.
The sstable reader will now allow preemption at row granularity; previously, sstables containing many small rows could cause small latency spikes as the reader would only preempt when an 8k buffer was filled. #7883

Repair Base Operation (experimental)

Repair Base Operation was introduced as an experimental feature in Scylla 4.0, intending to use the same underline streaming implementation for streaming and repair. While still considered experimental, we continue to work on this feature.

Repair is oriented towards moving small amounts of data, not an entire node's worth. This resulted in many sstables being created in the node, creating a large compaction load. To fix that, offstrategy compaction is now used to compact these sstables without impacting the main workload efficiently. #5226

To enable repair base operation, add the following to scylla.yaml:

enable_repair_based_node_ops: true

Configuration

Ignore enable_sstables_mc_format: User can no longer disable MC format for older SSTable formats.

Other bugs fixed in this release

Stability: Optimized TWCS single-partition reader opens sstables unnecessarily #8432
Stability: TimeWindowCompactionStrategy not using specialized reader for single partition queries #8415
Stability: Scylla will exit when accessed with a LOCAL_QUORUM to a DC with zero replication (one can define different numbers of replication per DC). #8354
Tools: sstableloader: partition with old deletion and new data handled incorrectly #8390
Stability: Commitlog pre-fill inner loop condition broken #8369
aws: aws_instance.ebs_disks() causes traceback when no EBS disks #8365
thrift: handle gate closed exception on retry #8337
Stability: missing dead row marker for KA/LA file format #8324. Note that the KA/LA sstable formats are legacy formats that are not used in latest Scylla versions.
inactive readers unification caused lsa OOM in toppartitions_test #8258
Thrift: too many accept attempts end up in segmentation fault #8317
Stability: Failed SELECT with tuple of reversed-ordered frozen collections #7902
Stability: Certain combination of filtering, index, and frozen collection, causes "marshalling error" failure #7888
build : tools/toolchain: install-dependencies.sh causes error during build Docker image, and ignoring it #8293
Stability: Use-after-free in simple_repair_test #8274
Monitoring: storage_proxy counters are not updated on cql counter operations #4337
Security: Enforce dc/rack membership iff required for non-tls connections #8051
Stability: Scylla tries to keep enough free memory ahead of allocation, so that allocations don't stall. The amount of CPU power devoted to background reclaim is supposed to self-tune with memory demand, but this wasn't working correctly. #8234
Nodetool cleanup failed because of "DC or rack not found in snitch properties" #7930
Stability: a possible race condition in MV/SI schema creation and load may cause inconsistency between base table and view table #7709
Thrift: Regression in thrift_tests.test_get_range_slice dtest: query_data_on_all_shards(): reverse range scans are not supported #8211
Stability: mutation_test: fatal error: in "test_apply_monotonically_is_monotonic": Mutations differ
#8154
Stability: Node was overloaded: Too many in flight hints during Enospc nemesis #8137
Stability: Make untyped_result_set non-copying and retain fragments #8014
Stability: Requests are not entirely read during shedding, which leads to invalidating the connection once shedding happens. Shedding is the process of dropping requests to protect the system, for example, if they are too large or exceeding the max number of concurrent requests per shard. #8193
Stability:Versioned sstable_set #2622
UX: Improve the verbosity of errors coming from the view builder/updater #8177
Tools: Incorrect output in nodetool compactionstats #7927
Stability: cache-bypassing single-partition query from TWCS table not showing a row (but it appears in range scans). Introduce after Scylla 4.4 #8138
CQL: unpaged query is terminated silently if it reaches global limit first. The bug was introduced in Scylla 4.3 #8162
Stability: The multishard combining reader is responsible for merging data from multiple cores when a range scan runs. A bug that is triggered by very small token ranges (e.g. 1 token) caused shards that have no data to contribute to be queried, increasing read amplification. #8161
Stability: Repairing a table with TWCS potentially cause high number of parallel compaction #8124
Stability: Run init_server and join_cluster inside maintenance scheduling group #8130
Install: scylla_create_devices fails on EC2 with subprocess.CalledProcessError: Command /opt/scylladb/scripts/scylla_raid_setup... returned non-zero exit status 1 #8055
Stability: cdc: log: use-after-free in process_bytes_visitor #8117
Stability: Repair task from manager failed due to coredumpt on one of the node #8059
CQL: NetworkTopologyStrategy data center options are not validated #7595
Stability: no local limit for non-limited queries in mixed cluster may cause repair to fail #8022
Debug: Make scylla backtraces always print in oneline #5464
Init: perftune.py fails with TypeError: 'NoneType' object is not iterable #8008
Stability: using experimental UDF can lead to exit #7977
Stability: Make commitlog accept N mutations in bulk #7615
Stability: transport: Fix abort on certain configurations of native_transport_port(_ssl) #7866 #7783
Debug: add sstable origin information to scylla metadata component #7880
Install: dist/offline_installer/redhat: causes "scylla does not work with current umask setting (0077)" #6243
Alternator: nodetool cannot work on table with a dot in its name #6521
Stability: During replace node operation - replacing node is used to respond to read queries #7312
Install: Scylla doesn't use /etc/security/limits.d/scylla.conf #7925
Stability: multishard_combining_reader uses smp::count in one place instead of _sharder.shard_count() #7945
Stability: Failed fromJson() should result in FunctionFailure error, not an internal error #7911
Stability: List append uses the wrong timestamp with LWT #7611
Stability: currentTimeUUID creates duplicates when called at the same point in time #6208
Build: dbuild fails with an error on older kernels (without cgroupsv2) #7938
Stability: Error: "seastar - Exceptional future ignored: sstables::compaction_stop_exception" after node drain #7904
UX: Scylla reports broken pipe and connection reset by peer errors from the native transport, although it can happen in normal operation.#7907
Redis: edis 'exists' command fails with lots of keys #7273
UX: Make scylla backtraces always print in online #5464
Stability: A mistake in Time Window Compaction Strategy logic could cause windows that had a very large number of sstables not to be compacted at all, increasing read amplification. #8147
Stability: missing dead row marker for KA/LA file format #8324. Note that the KA/LA sstable formats are legacy formats that are not used in latest Scylla versions.

Reply all

Reply to author

Forward

0 new messages