[PATCH 1/1] compaction: scrub/validate: prevent printing non-utf8 partition keys

1 view
Skip to first unread message

Benny Halevy

<bhalevy@scylladb.com>
unread,
Jul 29, 2021, 10:12:53 AM7/29/21
to scylladb-dev@googlegroups.com, Benny Halevy, Botond Dénes
Corrupt keys might be printed as non-utf8 strings to the log,
and that, in turn, may break applications reading the logs,
such as Python (3.7)

For example:
```
Traceback (most recent call last):
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown
self.cleanUpCluster()
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster
matches = node.grep_log(expr)
File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log
for line in f:
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte
```

Signed-off-by: Benny Halevy <bha...@scylladb.com>
---
compaction/compaction.cc | 12 ++++++------
dht/i_partitioner.cc | 8 ++++++++
dht/i_partitioner.hh | 2 ++
3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/compaction/compaction.cc b/compaction/compaction.cc
index 8af0ba2fa..0bd25a8d8 100644
--- a/compaction/compaction.cc
+++ b/compaction/compaction.cc
@@ -1199,9 +1199,9 @@ class scrub_compaction final : public regular_compaction {
type,
schema.ks_name(),
schema.cf_name(),
- new_key.key().with_schema(schema),
+ new_key.to_string(schema),
new_key,
- current_key.key().with_schema(schema),
+ current_key.to_string(schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1214,9 +1214,9 @@ class scrub_compaction final : public regular_compaction {
type,
schema.ks_name(),
schema.cf_name(),
- new_key.key().with_schema(schema),
+ new_key.to_string(schema),
new_key,
- current_key.key().with_schema(schema),
+ current_key.to_string(schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1234,7 +1234,7 @@ class scrub_compaction final : public regular_compaction {
mf.mutation_fragment_kind(),
mf.has_key() ? format(" with key {}", mf.key().with_schema(schema)) : "",
mf.position(),
- key.key().with_schema(schema),
+ key.to_string(schema),
key,
prev_pos.region(),
prev_pos.has_key() ? format(" with key {}", prev_pos.key().with_schema(schema)) : "",
@@ -1246,7 +1246,7 @@ class scrub_compaction final : public regular_compaction {
const auto& schema = validator.schema();
const auto& key = validator.previous_partition_key();
clogger.error("[{} compaction {}.{}] Invalid end-of-stream, last partition {} ({}) didn't end with a partition-end fragment{}{}",
- type, schema.ks_name(), schema.cf_name(), key.key().with_schema(schema), key, action.empty() ? "" : "; ", action);
+ type, schema.ks_name(), schema.cf_name(), key.to_string(schema), key, action.empty() ? "" : "; ", action);
}

private:
diff --git a/dht/i_partitioner.cc b/dht/i_partitioner.cc
index 410be9227..dff1dd2e7 100644
--- a/dht/i_partitioner.cc
+++ b/dht/i_partitioner.cc
@@ -33,6 +33,8 @@
#include <boost/range/adaptor/transformed.hpp>
#include "sstables/key.hh"
#include <seastar/core/thread.hh>
+#include <seastar/core/sstring.hh>
+#include "utils/utf8.hh"

namespace dht {

@@ -147,6 +149,12 @@ decorated_key::less_comparator::operator()(const decorated_key& lhs, const ring_
return lhs.tri_compare(*s, rhs) < 0;
}

+sstring
+decorated_key::to_string(const schema& s) const {
+ sstring ret = format("{}", key().with_schema(s));
+ return utils::utf8::validate((const uint8_t*)ret.data(), ret.size()) ? ret : "<non-utf8-key>";
+}
+
std::ostream& operator<<(std::ostream& out, const ring_position_ext& pos) {
return out << (ring_position_view)pos;
}
diff --git a/dht/i_partitioner.hh b/dht/i_partitioner.hh
index e64bbe3e4..bc88dadea 100644
--- a/dht/i_partitioner.hh
+++ b/dht/i_partitioner.hh
@@ -129,6 +129,8 @@ class decorated_key {
size_t memory_usage() const {
return sizeof(decorated_key) + external_memory_usage();
}
+
+ sstring to_string(const schema& s) const;
};


--
2.31.1

Botond Dénes

<bdenes@scylladb.com>
unread,
Jul 30, 2021, 1:21:59 AM7/30/21
to Benny Halevy, scylladb-dev@googlegroups.com
Isn't this too specific to scrub to be made a method of decorated_key?
I think it would be better to make it a static member of
scrub_compaction instead, as we are not expecting to see corrupt keys
anywhere else in the code. Additionally if you make it a decorated_key
member it will start competing with `with_schema()` which is
undesirable.

Benny Halevy

<bhalevy@scylladb.com>
unread,
Jul 30, 2021, 6:13:03 AM7/30/21
to Botond Dénes, scylladb-dev@googlegroups.com
Fair enough. Funny thing is that this is the way I implemented
it first, but it seemed generic enough to expose as a method of decorated_key.

At any rate, we can expose it there if needed in the future...

Benny Halevy

<bhalevy@scylladb.com>
unread,
Jul 30, 2021, 6:54:33 AM7/30/21
to scylladb-dev@googlegroups.com, Benny Halevy, Botond Dénes
Corrupt keys might be printed as non-utf8 strings to the log,
and that, in turn, may break applications reading the logs,
such as Python (3.7)

For example:
```
Traceback (most recent call last):
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown
self.cleanUpCluster()
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster
matches = node.grep_log(expr)
File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log
for line in f:
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte
```

Test: unit(dev)
DTest: scrub_with_one_node_expect_data_loss_test

Signed-off-by: Benny Halevy <bha...@scylladb.com>
---
compaction/compaction.cc | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

In v2:
- Define partition_key_to_string as static method in scrub_compaction

diff --git a/compaction/compaction.cc b/compaction/compaction.cc
index 542e97fb4..97e279f20 100644
--- a/compaction/compaction.cc
+++ b/compaction/compaction.cc
@@ -75,6 +75,7 @@
#include "mutation_source_metadata.hh"
#include "mutation_fragment_stream_validator.hh"
#include "utils/UUID_gen.hh"
+#include "utils/utf8.hh"

namespace sstables {

@@ -1202,9 +1203,9 @@ class scrub_compaction final : public regular_compaction {
type,
schema.ks_name(),
schema.cf_name(),
- new_key.key().with_schema(schema),
+ partition_key_to_string(new_key.key(), schema),
new_key,
- current_key.key().with_schema(schema),
+ partition_key_to_string(current_key.key(), schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1217,9 +1218,9 @@ class scrub_compaction final : public regular_compaction {
type,
schema.ks_name(),
schema.cf_name(),
- new_key.key().with_schema(schema),
+ partition_key_to_string(new_key.key(), schema),
new_key,
- current_key.key().with_schema(schema),
+ partition_key_to_string(current_key.key(), schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1237,7 +1238,7 @@ class scrub_compaction final : public regular_compaction {
mf.mutation_fragment_kind(),
mf.has_key() ? format(" with key {}", mf.key().with_schema(schema)) : "",
mf.position(),
- key.key().with_schema(schema),
+ partition_key_to_string(key.key(), schema),
key,
prev_pos.region(),
prev_pos.has_key() ? format(" with key {}", prev_pos.key().with_schema(schema)) : "",
@@ -1249,10 +1250,15 @@ class scrub_compaction final : public regular_compaction {
const auto& schema = validator.schema();
const auto& key = validator.previous_partition_key();
clogger.error("[{} compaction {}.{}] Invalid end-of-stream, last partition {} ({}) didn't end with a partition-end fragment{}{}",
- type, schema.ks_name(), schema.cf_name(), key.key().with_schema(schema), key, action.empty() ? "" : "; ", action);
+ type, schema.ks_name(), schema.cf_name(), partition_key_to_string(key.key(), schema), key, action.empty() ? "" : "; ", action);
}

private:
+ static sstring partition_key_to_string(const partition_key& key, const ::schema& s) {
+ sstring ret = format("{}", key.with_schema(s));
+ return utils::utf8::validate((const uint8_t*)ret.data(), ret.size()) ? ret : "<non-utf8-key>";
+ }
+
class reader : public flat_mutation_reader::impl {
using skip = bool_class<class skip_tag>;
private:
--
2.31.1

Botond Dénes

<bdenes@scylladb.com>
unread,
Jul 30, 2021, 7:26:48 AM7/30/21
to Benny Halevy, scylladb-dev@googlegroups.com
LGTM

Benny Halevy

<bhalevy@scylladb.com>
unread,
Aug 18, 2021, 9:10:02 AM8/18/21
to Maintainers, scylladb-dev@googlegroups.com
ping

(patch still applies to master @ 9c7bcd1d85a9ba5127d26a46d0e546e4b746fad4)

Benny Halevy

<bhalevy@scylladb.com>
unread,
Aug 26, 2021, 11:31:03 AM8/26/21
to Maintainers, scylladb-dev@googlegroups.com
ping^

Avi Kivity

<avi@scylladb.com>
unread,
Aug 26, 2021, 11:33:26 AM8/26/21
to Benny Halevy, scylladb-dev@googlegroups.com, Botond Dénes
Maybe we should return something? e.g. "\\xff\\xfe" for a string
containing the bytes 0xff 0xfe.

Benny Halevy

<bhalevy@scylladb.com>
unread,
Aug 26, 2021, 1:03:20 PM8/26/21
to Avi Kivity, scylladb-dev@googlegroups.com, Botond Dénes
It would be neat for a general purpose `to_string` method,
but in this use case, the string representation is always
immediately followed by a hex representation anyhow.

Benny Halevy

<bhalevy@scylladb.com>
unread,
Sep 8, 2021, 8:15:57 AM9/8/21
to Avi Kivity, scylladb-dev@googlegroups.com, Botond Dénes
ping

Benny Halevy

<bhalevy@scylladb.com>
unread,
Sep 12, 2021, 2:50:36 AM9/12/21
to Avi Kivity, scylladb-dev@googlegroups.com
ping

Avi, the patch is still needed
and still applies.

Please see my reply below to your suggestion.
I don't think that printing a hex representation
of non-utf8 keys is required in this case since
we already print the key bytes in hex right after
the string representation anyhow when reporting
scrub/validate errors.

Commit Bot

<bot@cloudius-systems.com>
unread,
Sep 12, 2021, 3:52:30 AM9/12/21
to scylladb-dev@googlegroups.com, Benny Halevy
From: Benny Halevy <bha...@scylladb.com>
Committer: Avi Kivity <a...@scylladb.com>
Branch: next

compaction: scrub/validate: prevent printing non-utf8 partition keys

Corrupt keys might be printed as non-utf8 strings to the log,
and that, in turn, may break applications reading the logs,
such as Python (3.7)

For example:
```
Traceback (most recent call last):
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown
self.cleanUpCluster()
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster
matches = node.grep_log(expr)
File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log
for line in f:
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte
```

Test: unit(dev)
DTest: scrub_with_one_node_expect_data_loss_test

Signed-off-by: Benny Halevy <bha...@scylladb.com>
Message-Id: <20210730105428.2...@scylladb.com>

---
diff --git a/compaction/compaction.cc b/compaction/compaction.cc
--- a/compaction/compaction.cc
+++ b/compaction/compaction.cc
@@ -75,6 +75,7 @@
#include "mutation_source_metadata.hh"
#include "mutation_fragment_stream_validator.hh"
#include "utils/UUID_gen.hh"
+#include "utils/utf8.hh"

namespace sstables {

@@ -1213,9 +1214,9 @@ class scrub_compaction final : public regular_compaction {
type,
schema.ks_name(),
schema.cf_name(),
- new_key.key().with_schema(schema),
+ partition_key_to_string(new_key.key(), schema),
new_key,
- current_key.key().with_schema(schema),
+ partition_key_to_string(current_key.key(), schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1228,9 +1229,9 @@ class scrub_compaction final : public regular_compaction {
type,
schema.ks_name(),
schema.cf_name(),
- new_key.key().with_schema(schema),
+ partition_key_to_string(new_key.key(), schema),
new_key,
- current_key.key().with_schema(schema),
+ partition_key_to_string(current_key.key(), schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1248,7 +1249,7 @@ class scrub_compaction final : public regular_compaction {
mf.mutation_fragment_kind(),
mf.has_key() ? format(" with key {}", mf.key().with_schema(schema)) : "",
mf.position(),
- key.key().with_schema(schema),
+ partition_key_to_string(key.key(), schema),
key,
prev_pos.region(),
prev_pos.has_key() ? format(" with key {}", prev_pos.key().with_schema(schema)) : "",
@@ -1260,10 +1261,15 @@ class scrub_compaction final : public regular_compaction {

Commit Bot

<bot@cloudius-systems.com>
unread,
Sep 12, 2021, 4:04:57 PM9/12/21
to scylladb-dev@googlegroups.com, Benny Halevy
From: Benny Halevy <bha...@scylladb.com>
Committer: Avi Kivity <a...@scylladb.com>
Branch: master

compaction: scrub/validate: prevent printing non-utf8 partition keys

Corrupt keys might be printed as non-utf8 strings to the log,
and that, in turn, may break applications reading the logs,
such as Python (3.7)

For example:
```
Traceback (most recent call last):
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown
self.cleanUpCluster()
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster
matches = node.grep_log(expr)
File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log
for line in f:
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte
```

Test: unit(dev)
DTest: scrub_with_one_node_expect_data_loss_test

Signed-off-by: Benny Halevy <bha...@scylladb.com>
Message-Id: <20210730105428.2...@scylladb.com>

---
diff --git a/compaction/compaction.cc b/compaction/compaction.cc
--- a/compaction/compaction.cc
+++ b/compaction/compaction.cc
@@ -75,6 +75,7 @@
#include "mutation_source_metadata.hh"
#include "mutation_fragment_stream_validator.hh"
#include "utils/UUID_gen.hh"
+#include "utils/utf8.hh"

namespace sstables {

@@ -1213,9 +1214,9 @@ class scrub_compaction final : public regular_compaction {
type,
schema.ks_name(),
schema.cf_name(),
- new_key.key().with_schema(schema),
+ partition_key_to_string(new_key.key(), schema),
new_key,
- current_key.key().with_schema(schema),
+ partition_key_to_string(current_key.key(), schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1228,9 +1229,9 @@ class scrub_compaction final : public regular_compaction {
type,
schema.ks_name(),
schema.cf_name(),
- new_key.key().with_schema(schema),
+ partition_key_to_string(new_key.key(), schema),
new_key,
- current_key.key().with_schema(schema),
+ partition_key_to_string(current_key.key(), schema),
current_key,
action.empty() ? "" : "; ",
action);
@@ -1248,7 +1249,7 @@ class scrub_compaction final : public regular_compaction {
mf.mutation_fragment_kind(),
mf.has_key() ? format(" with key {}", mf.key().with_schema(schema)) : "",
mf.position(),
- key.key().with_schema(schema),
+ partition_key_to_string(key.key(), schema),
key,
prev_pos.region(),
prev_pos.has_key() ? format(" with key {}", prev_pos.key().with_schema(schema)) : "",
@@ -1260,10 +1261,15 @@ class scrub_compaction final : public regular_compaction {
Reply all
Reply to author
Forward
0 new messages