[RELEASE] Scylla 4.5 RC1

125 views
Skip to first unread message

Tzach Livyatan

<tzach@scylladb.com>
unread,
Apr 29, 2021, 3:33:38 PM4/29/21
to ScyllaDB users, scylladb-dev

The Scylla team is pleased to announce Scylla Open Source 4.5 RC1, the first Release Candidate for the Scylla Open Source 4.5 minor release. Moving forward, we will only fix critical bugs in branch-4.5. We will continue to fix bugs and add features to the master branch.

Scylla 4.5 includes a new restore (load and stream) operation, Alternator support for CORS and many other performance, stability improvements and bug fixes (below). Find the Scylla Open Source 4.5 repository for your Linux distribution here. Scylla 4.5 RC1 Docker is also available.


Use the release candidate with caution; RC1 is not production-ready yet. You can help stabilize Scylla Open Source 4.5 by reporting bugs here

Only the last two minor releases of Scylla Open Source project are supported. Once Scylla Open Source 4.5 is officially released, only Scylla Open Source 4.5 and Scylla 4.4 will be supported, and Scylla 4.3 will be retired. 


Related Links     

* Get Scylla Open Source 4.5 (for each distro)

* Report a problem

New features in Scylla 4.5

Load and stream

This feature extends nodetool refresh to allow loading arbitrary sstables

that do not belong to a node into the cluster. It loads the sstables from the disk and calculates the data's owning nodes, and streams

automatically.

For example, say the old cluster has 6 nodes and the new cluster has 3 nodes.

We can copy the sstables from the old cluster to any of the new nodes and

trigger the load and stream process.

This can make restores and migrations much easier:

  • You can place sstable from every node to every node

  • No need to run nodetool cleanup to remove unused data

curl -X POST "http://{ip}:10000/storage_service/sstables/{keyspace}?cf={table}&load_and_stream=true

Alternator

  • Support for Cross-Origin Resource Sharing (CORS). This allows client browsers to access the database directly via JavaScript, avoiding the middle tier. #8025

  • Support limiting the number of concurrent requests with a scylla.yaml configuration value max_concurrent_requests_per_shard. In case the limit is crossed, Alternator will return RequestLimitExceeded error type (compatible with DynamoDB API) #7294

  • Alternator, Scylla's implementation of a dynamodb-compatible API now fully supports nested attribute paths. Nested attribute processing happens when an item's attribute is itself an object, and an operation modifies just one of the object's attributes instead of the entire object. #5024 #8043

  • Alternator now supports slow query logging capability. Queries that last longer than the specified threshold are logged in `system_traces.node_slow_log` and traced. #8292

  • Alternator was changed to avoid large contiguous allocations for large requests. Instead, the allocation will be broken up into smaller chunks. This reduces stress on the allocator, and therefore latency. #7213

  • sstableloader now work with Alternator tables #8229

  • Support attribute paths in ConditionExpression, FilterExpression

  • Support attribute paths in ProjectionExpression

Raft

We are building up an internal service in Scylla, useful for this and other applications. The changes have no visible effect yet. Among other, the following was added:

  • Scylla stores the database schema in a set of tables. Previously, these tables were sharded across all cores like ordinary user tables. They are now maintained by shard 0 alone. This is a step towards letting Raft manage them, since raft needs to atomically modify the schema tables, and this can't be done if the data is distributed on many cores. #7947

  • Raft can now store its log data in a system table. Raft is implemented in a modular fashion with plug-ins implementing various parts; this is a persistence module.

  • Raft Joint Consensus has been merged. This is the ability to change a raft group from one set of nodes to another, needed to change cluster topology or to migrate data to different nodes.

  • Raft now integrates with the Scylla RPC subsystem; raft itself is modular and requires integration with the various Scylla service providers.

  • The Raft implementation gained support for non-voting nodes. This is used to make membership changes less disruptive.

  • The Raft implementation now has a per-server timer, used for Raft keepalives

  • The Raft implementation gained support for leader step down. This improves availability when a node is taken down for planned maintenance

Deployment and packaging

  • It is now possible to install Scylla on SUSE Linux Enterprise Server using the standard RPM packages. #8277

  • The setup utility now uses chrony instead of ntp for timekeeping on all Linux distributions. This makes setup more regular. #7922 

  • Dynamic setting of aio-max-nr based on the number of cpus, mostly needed for large machines like EC2 i3en.24xlarge #8133

CDC

  • The Change Data Capture (CDC) facility used a collection to store information about CDC log streams. Since large clusters can have many streams, this violation of Scylla guidelines caused many latency problems. Two steps were taken to correct it: the number of streams were limited (with some loss in efficiency on large clusters), and a new format was introduced (with automatic transition code) that uses partitions and clustering rows instead of collections. #7993

Tools and APIs

  • It is now possible to perform a partial repair when a node is missing, by using the new ignore_nodes option. Repair will also detect when a repair range has no live nodes to repair with and short-circuit the operation #7806 #8256

  • The Thrift API disable by default. As it is less often used, users might no be aware Thrift is open and might be a security risk.  #8336. To enable it, add “start_rpc: true” to scylla.yaml. In addition, Thrift now have

  • Nodetool Top Partitions extension. nodetool toppartitions allow you to find the partitions with the highest read and write access in the last time window. Till now, nodetool toppartitions only supported one table at a time. From Scylla 4.5, nodetool toppartitions allows specifying a list of tables, or keyspaces. #4520

  • nodetool stop now supports all compaction types: Supported types are: COMPACTION, CLEANUP, VALIDATION, SCRUB, INDEX_BUILD, RESHARD, UPGRADE, RESHAPE. For example: nodetool stop SCRUB. Note that reshard and reshape starts automatically on boot or refresh, if needed. Compaction, Cleanup, Scrub, and Upgrade are started with nodetool command. the others  (VALIDATION, INDEX_BUILD) are unsupported by nodetool stop.

  • scylla_setup option to retry the RAID setup #8174

  • New system/drop_sstable_caches RESTful API. Evicts objects from caches that reflect sstable content, like the row cache. In the future, it will also drop the page cache and sstable index caches. While exiting BYPASS CACHE affects the behavior of a given CQL query on per-query basis, this API clears the cache at the time of invocation, later queries will populate it.

  • REST API: add the compaction id to the response of GET compaction_manager/compactions

  • Lightweight (fast) slow queries logging mode. New, low overhead, for slow queries tracing. When enabled, it will work in the same way slow query tracing does besides that it will omit recording all the tracing events. So that it will not populate data to the system_traces.events table but it will populate trace session records for slow queries to all the rest: system_traces.sessions, system_traces.node_slow_log, etc. #2572 . More here

Performance Optimizations

  • Improve flat_mutation_reader::consume_pausable #8359. Combined reader microbenchmark has shown from 2% to 22% improvement in median execution time while memtable microbenchmark has shown from 3.6% to 7.8% improvement in median execution time.

  • Significant write amplification when reshaping level 0 in a LCS table #8345

  • The Log-Structured Allocator (LSA) is the underlying memory allocator behind Scylla's cache and memtables. When memory runs out, it is called to evict objects from cache, and to defragement free memory, in order to serve new allocation requests. If memory was especially fragmented, or if the allocation request was large, this could take a long while, causing a latency spike. To combat this, a new background reclaim service is added which evicts and defragments memory ahead of time and maintains a watermark of free, non-fragmented memory from which allocations can be satisfied quickly. This is somewhat similar to kswapd on Linux. #1634

  • To store cells in rows, Scylla used a combination of a vector (for short rows) and red-black tree (for wide rows), switching between the representations dynamically. The red-black is inefficient in memory footprint when many cells are present, so the data storage now uses a radix tree exclusively. This both reduces the memory footprint and also improves efficiency.

  • sstables: Share partition index pages between readers. Before this patch, each index reader had its own cache of partition index pages. Now there is a shared cache, owned by the sstable object. This allows concurrent reads to share partition index pages and thus reduce the amount of I/O. For IO-bound, we needed 2 I/O per read before, and 1 (amortized) now. The throughput is ~70% higher. More

  • Switch partition rows onto B-tree. The data type for storing rows inside a partition was changed from a red-black tree to a B-tree. This saves space and spares some cpu cycles. More here.

  • The sstable reader will now allow preemption at row granularity; previously, sstables containing many small rows could cause small latency spikes as the reader would only preempt when an 8k buffer was filled.  #7883

Repair Base Operation (experimental)

Repair Base Operation was introduced as an experimental feature in Scylla 4.0, intending to use the same underline streaming implementation for streaming and repair. While still considered experimental, we continue to work on this feature. 

Repair is oriented towards moving small amounts of data, not an entire node's worth. This resulted in many sstables being created in the node, creating a large compaction load. To fix that, offstrategy compaction is now used to compact these sstables without impacting the main workload efficiently. #5226

To enable repair base operation, add the following to scylla.yaml: 

enable_repair_based_node_ops: true

Configuration

Other bugs fixed in this release

  • Stability: Optimized TWCS single-partition reader opens sstables unnecessarily #8432 

  • Stability: TimeWindowCompactionStrategy not using specialized reader for single partition queries #8415

  • Stability: Scylla will exit when accessed with a LOCAL_QUORUM  to a DC with zero replication (one can define different numbers of replication per DC). #8354

  • Tools: sstableloader: partition with old deletion and new data handled incorrectly #8390

  • Stability: Commitlog pre-fill inner loop condition broken #8369

  • aws: aws_instance.ebs_disks() causes traceback when no EBS disks #8365

  • thrift: handle gate closed exception on retry #8337

  • Stability: missing dead row marker for KA/LA file format #8324. Note that the KA/LA sstable formats are legacy formats that are not used in latest Scylla versions.

  • inactive readers unification caused lsa OOM in toppartitions_test #8258

  • Thrift: too many accept attempts end up in segmentation fault #8317

  • Stability: Failed SELECT with tuple of reversed-ordered frozen collections #7902

  • Stability: Certain combination of filtering, index, and frozen collection, causes "marshalling error" failure #7888

  • build : tools/toolchain: install-dependencies.sh causes error during build Docker image, and ignoring it #8293

  • Stability: Use-after-free in simple_repair_test #8274

  • Monitoring: storage_proxy counters are not updated on cql counter operations #4337

  • Security: Enforce dc/rack membership iff required for non-tls connections #8051

  • Stability: Scylla tries to keep enough free memory ahead of allocation, so that allocations don't stall. The amount of CPU power devoted to background reclaim is supposed to self-tune with memory demand, but this wasn't working correctly. #8234

  • Nodetool cleanup failed because of "DC or rack not found in snitch properties" #7930

  • Stability: a possible race condition in MV/SI schema creation and load may cause inconsistency between base table and view table #7709

  • Thrift: Regression in thrift_tests.test_get_range_slice dtest: query_data_on_all_shards(): reverse range scans are not supported #8211

  • Stability: mutation_test: fatal error: in "test_apply_monotonically_is_monotonic": Mutations differ

  • #8154

  • Stability: Node was overloaded: Too many in flight hints during Enospc nemesis #8137

  • Stability: Make untyped_result_set non-copying and retain fragments #8014

  • Stability: Requests are not entirely read during shedding, which leads to invalidating the connection once shedding happens. Shedding is the process of dropping requests to protect the system, for example, if they are too large or exceeding the max number of concurrent requests per shard. #8193

  • Stability:Versioned sstable_set #2622

  • UX: Improve the verbosity of errors coming from the view builder/updater #8177

  • Tools: Incorrect output in nodetool compactionstats #7927

  • Stability: cache-bypassing single-partition query from TWCS table not showing a row (but it appears in range scans). Introduce after Scylla 4.4 #8138

  • CQL: unpaged query is terminated silently if it reaches global limit first. The bug was introduced in Scylla 4.3  #8162

  • Stability: The multishard combining reader is responsible for merging data from multiple cores when a range scan runs. A bug that is triggered by very small token ranges (e.g. 1 token) caused shards that have no data to contribute to be queried, increasing read amplification. #8161

  • Stability: Repairing a table with TWCS potentially cause high number of parallel compaction #8124

  • Stability: Run init_server and join_cluster inside maintenance scheduling group #8130

  • Install: scylla_create_devices fails on EC2  with subprocess.CalledProcessError: Command /opt/scylladb/scripts/scylla_raid_setup... returned non-zero exit status 1 #8055

  • Stability: cdc: log: use-after-free in process_bytes_visitor #8117

  • Stability: Repair task from manager failed due to coredumpt on one of the node #8059

  • CQL: NetworkTopologyStrategy data center options are not validated #7595

  • Stability: no local limit for non-limited queries in mixed cluster may cause repair to fail #8022

  • Debug: Make scylla backtraces always print in oneline #5464

  • Init: perftune.py fails with TypeError: 'NoneType' object is not iterable #8008

  • Stability: using experimental UDF can lead to exit #7977

  • Stability: Make commitlog accept N mutations in bulk #7615

  • Stability: transport: Fix abort on certain configurations of native_transport_port(_ssl) #7866 #7783

  • Debug: add sstable origin information to scylla metadata component #7880

  • Install: dist/offline_installer/redhat: causes "scylla does not work with current umask setting (0077)" #6243

  • Alternator: nodetool cannot work on table with a dot in its name #6521

  • Stability: During replace node operation - replacing node is used to respond to read queries #7312

  • Install: Scylla doesn't use /etc/security/limits.d/scylla.conf #7925

  • Stability: multishard_combining_reader uses smp::count in one place instead of _sharder.shard_count() #7945

  • Stability: Failed fromJson() should result in FunctionFailure error, not an internal error #7911

  • Stability: List append uses the wrong timestamp with LWT #7611

  • Stability: currentTimeUUID creates duplicates when called at the same point in time #6208

  • Build: dbuild fails with an error on older kernels (without cgroupsv2) #7938

  • Stability: Error: "seastar - Exceptional future ignored: sstables::compaction_stop_exception" after node drain #7904

  • UX: Scylla reports broken pipe and connection reset by peer errors from the native transport, although it can happen in normal operation.#7907

  • Redis: edis 'exists' command fails with lots of keys #7273

  • UX: Make scylla backtraces always print in online #5464

  • Stability: A mistake in Time Window Compaction Strategy logic could cause windows that had a very large number of sstables not to be compacted at all, increasing read amplification. #8147

  • Stability: missing dead row marker for KA/LA file format #8324. Note that the KA/LA sstable formats are legacy formats that are not used in latest Scylla versions.

Reply all
Reply to author
Forward
0 new messages