scylla 3.1, records disappear in big table

222 views
Skip to first unread message

Michael

<micha-1@fantasymail.de>
unread,
Jan 15, 2020, 5:39:03 AM1/15/20
to scylladb-users@googlegroups.com
Hello,


we have the following issue and are unable to find a fix:

During the import of data, records that were inserted (and could be
found) are suddenly missing, but after some time, are there again
(although not all of them).

We assume (and may be wrong with that) that this issue started to show
up after upgrading to scylla 3.1 from version 3.0

In more detail:

One of our tables has around 6 billion (6.000.000.000) records. A record
consisting of a few short blobs (< 60 bytes), some numbers, a text
field, a timestamp and a map<text, frozen<set<text>>>, which is null at
the moment.

We have an importer software (java) which inserts the data. The cluster
uses hdd, so insert rates are a few 10000 records per second.

We checked the software many times, inserted countless logging
statements and didn't got any errors during the last import.
But: one of us constantly uses some data to make software checks and he
recognized that data was missing, which was already found in a previous
check.

So we repeated the cycle of debugging, importing, checking. But we are
not able to get rid of this issue.
Today again, a record was reported to be missing, so I checked it on
cqlsh and it indeed was missing. Half an hour later I issued the same
select, but this time the record was found (!).

We dropped the table and reimported the data. All seemed fine, but after
some time records were missing.

It seems that this weird behaviour does not show up until the table
contains around 3 billion records. With fewer data, no missing records
were recognized.

At the moment I have no idea whats going on. Is there a possibility,
that during a big import, ongoing compactions have an (unwanted) impact
on selecting data? So that after the compaction data is found again?

Is the dependency on ssd disks that heavy, that such an issue can arise?


So in short: when inserting more than around 3 billion records, records
which were found in the table disappear later, some reappear after some
time, many not.

The cluster consists of 8 nodes, each with 64GB RAM, two 3.5T hdd as
raid0, replication is 3, insert and select consistency is two.


I really appreciate some hints on this.

Thanks for helping,
Michael

Dor Laor

<dor@scylladb.com>
unread,
Jan 15, 2020, 11:27:31 AM1/15/20
to ScyllaDB users
On Wed, Jan 15, 2020 at 2:39 AM Michael <mic...@fantasymail.de> wrote:
>
> Hello,
>
>
> we have the following issue and are unable to find a fix:
>
> During the import of data, records that were inserted (and could be
> found) are suddenly missing, but after some time, are there again
> (although not all of them).
>
> We assume (and may be wrong with that) that this issue started to show
> up after upgrading to scylla 3.1 from version 3.0

Can you verify it - test the same process with 3.0 vs 3.1?
Will be good to provide the exact versions too

>
> In more detail:
>
> One of our tables has around 6 billion (6.000.000.000) records. A record
> consisting of a few short blobs (< 60 bytes), some numbers, a text
> field, a timestamp and a map<text, frozen<set<text>>>, which is null at
> the moment.
>
> We have an importer software (java) which inserts the data. The cluster
> uses hdd, so insert rates are a few 10000 records per second.
>
> We checked the software many times, inserted countless logging
> statements and didn't got any errors during the last import.
> But: one of us constantly uses some data to make software checks and he
> recognized that data was missing, which was already found in a previous
> check.
>
> So we repeated the cycle of debugging, importing, checking. But we are
> not able to get rid of this issue.
> Today again, a record was reported to be missing, so I checked it on
> cqlsh and it indeed was missing. Half an hour later I issued the same
> select, but this time the record was found (!).

Were any deletes applied or updates for this key?
Do you use client timestamps?
Do all of the servers have synchronized time (NTP)?
Were repair running in the middle?


>
> We dropped the table and reimported the data. All seemed fine, but after
> some time records were missing.
>
> It seems that this weird behaviour does not show up until the table
> contains around 3 billion records. With fewer data, no missing records
> were recognized.
>
> At the moment I have no idea whats going on. Is there a possibility,
> that during a big import, ongoing compactions have an (unwanted) impact
> on selecting data? So that after the compaction data is found again?
>
> Is the dependency on ssd disks that heavy, that such an issue can arise?

I don't have a good suggestion but for sure it's not related to sdd
vs hdd nor the
amount of records. It shouldn't be related to compaction too.

>
>
> So in short: when inserting more than around 3 billion records, records
> which were found in the table disappear later, some reappear after some
> time, many not.
>
> The cluster consists of 8 nodes, each with 64GB RAM, two 3.5T hdd as
> raid0, replication is 3, insert and select consistency is two.
>
>
> I really appreciate some hints on this.
>
> Thanks for helping,
> Michael
>
> --
> You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/c953b65c-b718-c789-5626-e5ee9beb1560%40fantasymail.de.

Michael

<micha-1@fantasymail.de>
unread,
Jan 16, 2020, 4:48:49 AM1/16/20
to scylladb-users@googlegroups.com, Dor Laor


Am 15.01.20 um 17:26 schrieb Dor Laor:
>
> Can you verify it - test the same process with 3.0 vs 3.1?
> Will be good to provide the exact versions too

hmm, not that fast, downgrading is not an option (or even possible?),
depends on hardware.

scylla release version is 3.1.1-0.20191024.f32ec885c

>
> Were any deletes applied or updates for this key?
> Do you use client timestamps?
> Do all of the servers have synchronized time (NTP)?
> Were repair running in the middle?

no deletes or updates, no repairs, no special timestamps.
The servers are synced via NTP


Michael


Avi Kivity

<avi@scylladb.com>
unread,
Jan 16, 2020, 4:53:54 AM1/16/20
to scylladb-users@googlegroups.com, Michael
What consistency level are you using? Both for writes and reads.


If either reads or writes use a consistency level of ONE or LOCAL_ONE,
then it's possible for what you describe to happen.


Michael

<micha-1@fantasymail.de>
unread,
Jan 16, 2020, 5:25:17 AM1/16/20
to Avi Kivity, scylladb-users@googlegroups.com

Am 16.01.20 um 10:53 schrieb Avi Kivity:
>
> On 15/01/2020 12.39, Michael wrote:
>>
>> The cluster consists of 8 nodes, each with 64GB RAM, two 3.5T hdd as
>> raid0, replication is 3, insert and select consistency is two.
>>
>>
>>
>>
>
> What consistency level are you using? Both for writes and reads.
>
>
> If either reads or writes use a consistency level of ONE or LOCAL_ONE,
> then it's possible for what you describe to happen.
>
no, its two and quorum (which means also two since consistency is three)


 Michael



Michael

<micha-1@fantasymail.de>
unread,
Jan 16, 2020, 5:32:00 AM1/16/20
to scylladb-users@googlegroups.com, Avi Kivity
corrected the last mail...



Am 16.01.20 um 10:53 schrieb Avi Kivity:

> On 15/01/2020 12.39, Michael wrote:
>> The cluster consists of 8 nodes, each with 64GB RAM, two 3.5T hdd as
>> raid0, replication is 3, insert and select consistency is two.
>>
>>
>>
>>
> What consistency level are you using? Both for writes and reads.
>
>
> If either reads or writes use a consistency level of ONE or LOCAL_ONE,
> then it's possible for what you describe to happen.
>
no, its two and quorum (which means also two since replication is three)


 Michael

Avi Kivity

<avi@scylladb.com>
unread,
Jan 16, 2020, 6:15:19 AM1/16/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com
Yes, you wrote that in your initial message, sorry.


The dependency on SSD is only for performance, not correctness. This
should have worked without issue.


Can you run the queries with CL=QUORUM and TRACING ON from cqlsh, and
provide examples of both successful and failing reads?


Michael

<micha-1@fantasymail.de>
unread,
Jan 16, 2020, 8:04:12 AM1/16/20
to scylladb-users@googlegroups.com, Avi Kivity

Am 16.01.20 um 12:15 schrieb Avi Kivity:
>
>
> The dependency on SSD is only for performance, not correctness. This
> should have worked without issue.
>
>
> Can you run the queries with CL=QUORUM and TRACING ON from cqlsh, and
> provide examples of both successful and failing reads?
>
>

hm, doesn't work. I think we hit
https://github.com/scylladb/scylla/issues/4601

Maybe the upgrade was not as smooth as hoped for. To be safe, maybe we
should setup a new 3.1.1 database and try again.

Not sure.


 Michael



Michael

<micha-1@fantasymail.de>
unread,
Jan 17, 2020, 5:15:54 AM1/17/20
to scylladb-users@googlegroups.com, Avi Kivity
Hi again,


I started a repair yesterday and it's getting worse.

One of us has a test set of 25000 records (out of 7'000'000'000) and
there were 3 missing.

Today he mentioned the now around 1500 records are missing, which were
available yesterday.

The single record I checked yesterday which was missing at first, then
appeared (the select in cqlsh found it) is now missing again.


The table uses SizeTieredCompactionStrategy

The disk 70% usage (5TB of 7.3)


 Michael




Dor Laor

<dor@scylladb.com>
unread,
Jan 17, 2020, 11:49:22 AM1/17/20
to ScyllaDB users, Avi Kivity
On Fri, Jan 17, 2020 at 2:15 AM Michael <mic...@fantasymail.de> wrote:
>
> Hi again,
>
>
> I started a repair yesterday and it's getting worse.
>
> One of us has a test set of 25000 records (out of 7'000'000'000) and
> there were 3 missing.
>
> Today he mentioned the now around 1500 records are missing, which were
> available yesterday.
>
> The single record I checked yesterday which was missing at first, then
> appeared (the select in cqlsh found it) is now missing again.

Can you check its existence on each of the replica nodes, separately?
Find the right
replicas, issue a query with CL=1 for this key and let's see what each
node reports.
Later, use your regular query (CL=2?) and let's see what happens.

Do you regularly run repair?
Note that you need to leave 50% free space with STCS.
Was there anything significant in the logs? The only thing I can think
about is compaction
but it's very safe. I guess you would have reported a ENOSPACE so it's
not the case.
How's your schema look like?

>
>
> The table uses SizeTieredCompactionStrategy
>
> The disk 70% usage (5TB of 7.3)
>
>
> Michael
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/85f4ff96-4d44-3cb2-fe80-01bea9c645bf%40fantasymail.de.

Michael

<micha-1@fantasymail.de>
unread,
Jan 20, 2020, 5:00:13 AM1/20/20
to scylladb-users@googlegroups.com, Dor Laor, Avi Kivity


Am 17.01.20 um 17:48 schrieb Dor Laor:
> On Fri, Jan 17, 2020 at 2:15 AM Michael <mic...@fantasymail.de> wrote:
>>
> Can you check its existence on each of the replica nodes, separately?
> Find the right
> replicas, issue a query with CL=1 for this key and let's see what each
> node reports.
> Later, use your regular query (CL=2?) and let's see what happens.

yes, I will do that.

> Do you regularly run repair?

after the import I ran a repair.

> Note that you need to leave 50% free space with STCS.


ok, that's 30% free space at the moment. But I think errors due to
missing hdd space would appear in the logs?

> Was there anything significant in the logs? The only thing I can think
> about is compaction
> but it's very safe. I guess you would have reported a ENOSPACE so it's
> not the case.

no, it's not the case.


> How's your schema look like?



schema:
create table errorness_table (
key blob,
id blob,
attr map<text, frozen<set<text>>>,
bin1 blob,
bin2 blob,
bin3 blob,
bin4 blob,
ts timestamp,
txt1 text,
txt2 text,
val1 bigint,
val2 bigint,
primary key (key, id) with compaction = {
'class':'SizeTieredCompactionStrategy'}
and compression = { 'chunk_length_in_kb': '16',
'sstable_compression':'org.apache.cassandra.io.compress.LZ4Compressor'}


The blobs are all less than 40 bytes in size, id is 4 bytes. The text
fields are also small, attr is null.


Michael

Botond Dénes

<bdenes@scylladb.com>
unread,
Jan 20, 2020, 6:42:23 AM1/20/20
to scylladb-users@googlegroups.com, Dor Laor, Avi Kivity
Are the missing records rows or entire partitions?
Can you show an example of a query you use when recrods are not showing
up?

>
>
> Michael
>


Michael

<micha-1@fantasymail.de>
unread,
Jan 21, 2020, 2:56:29 AM1/21/20
to scylladb-users@googlegroups.com, Botond Dénes, Dor Laor, Avi Kivity


Am 20.01.20 um 12:42 schrieb Botond Dénes:
>
>
> Are the missing records rows or entire partitions?
> Can you show an example of a query you use when recrods are not showing
> up?
>

Single rows are missing. independent on the time of the insert (it takes
two days to do so)

In cqlsh it's a normal select * from table where key = ...


I java we used async queries, then later we switched to synced queries
and handled the async stuff via java threads (just to see if the same
issue happens)


Michael

Botond Dénes

<bdenes@scylladb.com>
unread,
Jan 21, 2020, 8:02:19 AM1/21/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity
On Tue, 2020-01-21 at 08:56 +0100, Michael wrote:
>
> Am 20.01.20 um 12:42 schrieb Botond Dénes:
> >
> > Are the missing records rows or entire partitions?
> > Can you show an example of a query you use when recrods are not
> > showing
> > up?
> >
>
> Single rows are missing. independent on the time of the insert (it
> takes
> two days to do so)
>
> In cqlsh it's a normal select * from table where key = ...


Can you please do what Avi suggested, i.e. find a partition that has
missing records, then execute it against each node in turn with
CL=LOCAL_ONE?

Michael

<micha-1@fantasymail.de>
unread,
Jan 21, 2020, 10:25:19 AM1/21/20
to scylladb-users@googlegroups.com, Botond Dénes, Dor Laor, Avi Kivity

Am 21.01.20 um 14:02 schrieb Botond Dénes:
>
> Can you please do what Avi suggested, i.e. find a partition that has
> missing records, then execute it against each node in turn with
> CL=LOCAL_ONE?

yes, of course:


I tried with two keys which are missing, one of them is the one which
was missing, then appeared, then disappeared.

I issued nodetool getendpoints and got three nodes back.

On each of the nodes, with consistency local_one, I executed a select
with that key.

The selects all returned an empty result.


Thanks

 Michael





Botond Dénes

<bdenes@scylladb.com>
unread,
Jan 21, 2020, 10:35:33 AM1/21/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity
I'm somewhat confused, due to the ambigous term `key`, which can be a
partition key, clustering key, or the `key` column in your schema
(partition key). In an earlier email you mentioned that the missing
records are rows. So the query you executed looked like something like
this:

select * from errorness_table where key = ? and id = ?;


One more question: are `id`:s unique across partitions, i.e. can the
same `id` appear in more than one partition?

>
>
> Thanks
>
> Michael
>
>
>
>
>


Michael

<micha-1@fantasymail.de>
unread,
Jan 21, 2020, 10:47:51 AM1/21/20
to Botond Dénes, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity

Am 21.01.20 um 16:35 schrieb Botond Dénes:
>
> I'm somewhat confused, due to the ambigous term `key`, which can be a
> partition key, clustering key, or the `key` column in your schema
> (partition key). In an earlier email you mentioned that the missing
> records are rows. So the query you executed looked like something like
> this:

yes. I should be more exact here.

The column 'key' is the partition key

The column 'id' is the clustering key.

> select * from errorness_table where key = ? and id = ?;


it was:

select * from errorness_table where key = ?;


> One more question: are `id`:s unique across partitions, i.e. can the
> same `id` appear in more than one partition?
>

Our partition key is mostly unique, i.e. out of 7billion rows there are
around 5 collisions, i.e. pairs of the same partition key.

The clustering key 'id' is only 4 bytes long, so appears more often.

The combination of partition key and clustering key is unique.

 Michael



Botond Dénes

<bdenes@scylladb.com>
unread,
Jan 21, 2020, 11:16:42 AM1/21/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity
On Tue, 2020-01-21 at 16:47 +0100, Michael wrote:
> Am 21.01.20 um 16:35 schrieb Botond Dénes:
> > I'm somewhat confused, due to the ambigous term `key`, which can be
> > a
> > partition key, clustering key, or the `key` column in your schema
> > (partition key). In an earlier email you mentioned that the missing
> > records are rows. So the query you executed looked like something
> > like
> > this:
>
> yes. I should be more exact here.
>
> The column 'key' is the partition key
>
> The column 'id' is the clustering key.
>
> > select * from errorness_table where key = ? and id = ?;
>
> it was:
>
> select * from errorness_table where key = ?;


So this is the query that returned no results at all when executed
against each of the nodes with CL=LOCAL_ONE?

Michael

<micha-1@fantasymail.de>
unread,
Jan 22, 2020, 2:15:33 AM1/22/20
to scylladb-users@googlegroups.com, Botond Dénes, Dor Laor, Avi Kivity

Am 21.01.20 um 17:16 schrieb Botond Dénes:
>
>> it was:
>>
>> select * from errorness_table where key = ?;
>
> So this is the query that returned no results at all when executed
> against each of the nodes with CL=LOCAL_ONE?
>

yes, exactly.


Botond Dénes

<bdenes@scylladb.com>
unread,
Jan 22, 2020, 4:41:57 AM1/22/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity
Ok, so when records are missing, entire partitions are not showing up?
Or was this just one example, and in other examples only certain rows
were missing?

The reason I keep banging on this is that the underlying reason could
be very different depending on whether entire partitions or just
certain rows are missing.

Michael

<micha-1@fantasymail.de>
unread,
Jan 22, 2020, 4:53:35 AM1/22/20
to Botond Dénes, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity

Am 22.01.20 um 10:41 schrieb Botond Dénes:
> The reason I keep banging on this is that the underlying reason could
> be very different depending on whether entire partitions or just
> certain rows are missing.


ok, then I should not make any error here.

How do I check for sure, that it is not an entire partition missing, but
only some rows of a partition?




Botond Dénes

<bdenes@scylladb.com>
unread,
Jan 22, 2020, 5:04:23 AM1/22/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity
Well if queries like:

select * from errorness_table where key = ? and id = ?;

Either turn up with all or none of the data they are supposed to have
then entire partitions are missing.

If there is also a middle ground, i.e. sometimes only *parts* of the
data is present, some but not all the rows, then it is possibly the
missing rows case.

Also just to sum up what we know so far:
* Cluster of three nodes, scylla 3.1.1-0.20191024.f32ec885c, upgraded
from 3.x.
* Schema:
create table errorness_table (
key blob,
id blob,
attr map<text, frozen<set<text>>>,
bin1 blob,
bin2 blob,
bin3 blob,
bin4 blob,
ts timestamp,
txt1 text,
txt2 text,
val1 bigint,
val2 bigint,
primary key (key, id) with compaction = {
'class':'SizeTieredCompactionStrategy'}
and compression = { 'chunk_length_in_kb': '16',
'sstable_compression':'org.apache.cassandra.io.compress.LZ4Compresso
r'}
* No deletes, no TTL.
* Reads are like this: select * from errorness_table where key = ?;
with CL=LOCAL_TWO or CL=QUORUM (equivalent as there 3 nodes with RF=3).


One point which is not clear to me yet. Did you have all the data prior
to the upgrade? You mentioned importing the data. Did you import all
the data prior to the upgrade, or after? Or only imported parts of it?
How?

Michael

<micha-1@fantasymail.de>
unread,
Jan 22, 2020, 7:04:48 AM1/22/20
to scylladb-users@googlegroups.com, Botond Dénes, Dor Laor, Avi Kivity

Am 22.01.20 um 11:04 schrieb Botond Dénes:
> On Wed, 2020-01-22 at 10:53 +0100, Michael wrote:
>> ok, then I should not make any error here.
>> How do I check for sure, that it is not an entire partition missing,
>> but
>> only some rows of a partition?
>>
> Well if queries like:
>
> select * from errorness_table where key = ? and id = ?;
>
> Either turn up with all or none of the data they are supposed to have
> then entire partitions are missing.
>
> If there is also a middle ground, i.e. sometimes only *parts* of the
> data is present, some but not all the rows, then it is possibly the
> missing rows case.

I checked those 9 rows which have a duplicate partitionkey (so it should
be 9 pairs of rows).

Of these either all two rows are there  (found 4 pairs), or none of them.

The other rows have a unique partition key so it seems  whole partitions
are missing.


> Also just to sum up what we know so far:
> * Cluster of three nodes, scylla 3.1.1-0.20191024.f32ec885c, upgraded

It's an *eight* node cluster

> One point which is not clear to me yet. Did you have all the data prior
> to the upgrade? You mentioned importing the data. Did you import all
> the data prior to the upgrade, or after? Or only imported parts of it?
> How?
>
Yes we had imported all data using version 3.0.x and no missing data was
reported.

After the upgrade to 3.1, rows started to be missing. We imported fresh
, I think three or four times: in a new table, in another keyspace and
so on.

Missing starts to be recognized after about 3 billion records are inserted.




Avi Kivity

<avi@scylladb.com>
unread,
Jan 22, 2020, 7:11:53 AM1/22/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Botond Dénes, Dor Laor
What is your import procedure? Copying sstables, sstableloader,
something else?


Botond Dénes

<bdenes@scylladb.com>
unread,
Jan 22, 2020, 7:44:55 AM1/22/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity
Can you try to see if missing partitions turn up in a full scan? To
avoid scanning all partitions (I understand you have billions of them),
lets try with a subrange of the ring. Something like:

* Find out the token of a partition that is missing:
```
SELECT token(key) FROM errorness_table WHERE key = ?;
```

* Add/substract 1000 from this token and scan this range. As (AFAIK)
arithmetics are not allowed in WHERE clauses, we need to do this
manually:

T1 = token(key) - 1000
T2 = token(key) + 1000

```
SELECT * FROM errorness_table WHERE token(key) > T1 AND token(key) <
T2;
```

Note that depending on how many total partitions you may have and luck
this query could cover none, few or many partitions. You can use a
smaller number if this is covering too many partitions.

Please also use QUORUM for the scan.

Michael

<micha-1@fantasymail.de>
unread,
Jan 22, 2020, 8:33:00 AM1/22/20
to scylladb-users@googlegroups.com, Botond Dénes, Dor Laor, Avi Kivity

Am 22.01.20 um 13:44 schrieb Botond Dénes:
> On Wed, 2020-01-22 at 13:04 +0100, Michael wrote:
>>
>> The other rows have a unique partition key so it seems whole
>> partitions
>> are missing.
>
> Can you try to see if missing partitions turn up in a full scan? To
> avoid scanning all partitions (I understand you have billions of them),
> lets try with a subrange of the ring. Something like:
>
> * Find out the token of a partition that is missing:
> ```
> SELECT token(key) FROM errorness_table WHERE key = ?;
> ```

yes, but this only work with existing partitions, otherwise no rows are
returned.




Botond Dénes

<bdenes@scylladb.com>
unread,
Jan 22, 2020, 10:09:20 AM1/22/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Dor Laor, Avi Kivity
That's too bad, the partition itself is not needed to calculate the
token.

As a workaround, you can create a dummy keyspace and table, with the
same schema (you don't even need to have all the value columns), insert
the key you wish to query and then do the token select. The dummy
keyspace and table doesn't even have to be created on the production
cluster, you can use a dev cluster for this. The value of the token
depends only on the partitioner (configured per cluster), the schema
and the partition key value.

What I want to see is whether the partition is genuinely missing or its
only point queries that have problems finding it. The fact that
partitions re-appear from time-to-time as you mentioned earlier seems
to suggest this. BTW when partitions do re-appear, do they happen after
some modificaton to the partitions (inserting or overwirint new rows)?

Michael

<micha-1@fantasymail.de>
unread,
Jan 23, 2020, 3:53:31 AM1/23/20
to scylladb-users@googlegroups.com, Botond Dénes, Dor Laor, Avi Kivity

Am 22.01.20 um 16:09 schrieb Botond Dénes:
> On Wed, 2020-01-22 at 14:32 +0100, Michael wrote:
>
> As a workaround, you can create a dummy keyspace and table, with the
> same schema (you don't even need to have all the value columns), insert
> the key you wish to query and then do the token select.

nice, that works

> What I want to see is whether the partition is genuinely missing or its
> only point queries that have problems finding it. The fact that
> partitions re-appear from time-to-time as you mentioned earlier seems
> to suggest this.

I managed to get a result (5 rows) when using token - 100'000'000'000
and token + 10'000'000'000 , but the missing row was not in the result set.


> BTW when partitions do re-appear, do they happen after
> some modificaton to the partitions (inserting or overwirint new rows)?

We do not overwrite or delete records.

To clarify the re-appearing effect:  this was detected only by accident.
A record was reported to be missing, which was found in an earlier test.
I checked it and indeed didn't find it (the import still running). Later
I wanted to show it a colleague, and it was there, so it reappeared. I
have no way to find more such records. The main effect are records which
have been imported are disappearing and stay so.



Avi Kivity

<avi@scylladb.com>
unread,
Jan 23, 2020, 4:02:33 AM1/23/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Botond Dénes, Dor Laor
Can you detail how you perform the import?


Michael

<micha-1@fantasymail.de>
unread,
Jan 23, 2020, 4:39:15 AM1/23/20
to scylladb-users@googlegroups.com, Avi Kivity, Botond Dénes, Dor Laor

Am 23.01.20 um 10:02 schrieb Avi Kivity:
>
>
> Can you detail how you perform the import?
>
>

A java program reads json data, pushes the data in a concurrent queue
from which multiple threads fetch the data, bound a prepared insert
statement and execute it.

On case of an insert error we try again a few times. Every error and
exception is logged.


Avi Kivity

<avi@scylladb.com>
unread,
Jan 23, 2020, 4:44:53 AM1/23/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Botond Dénes, Dor Laor
I see. I was looking to see whether you were using an sstable-based
import process, which has many pitfalls. Although you did mention it was
Java based in the original post.


Do you have anything reported in Scylla logs?


Do you have other partitions at the same token as the missing partitions?


Are the tokens of the missing partitions at vnode boundaries?


Michael

<micha-1@fantasymail.de>
unread,
Jan 23, 2020, 7:48:06 AM1/23/20
to scylladb-users@googlegroups.com, Avi Kivity, Botond Dénes, Dor Laor

Am 23.01.20 um 10:44 schrieb Avi Kivity:
>
> I see. I was looking to see whether you were using an sstable-based
> import process, which has many pitfalls. Although you did mention it
> was Java based in the original post.
>
>
> Do you have anything reported in Scylla logs?
>

We watched the log during import but no error and no exception were shown.


> Do you have other partitions at the same token as the missing partitions?
>

We know only of 9 partitions which hold two rows. Either both rows are
or none.


> Are the tokens of the missing partitions at vnode boundaries?
>

That means that I have to check the tokens of missing partitions against
the tokenrange info I get from nodetool describering to see if these
token are either the lower or the upper boundary of a range?




Avi Kivity

<avi@scylladb.com>
unread,
Jan 23, 2020, 8:09:05 AM1/23/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Botond Dénes, Dor Laor

On 23/01/2020 14.48, Michael wrote:
> Am 23.01.20 um 10:44 schrieb Avi Kivity:
>> I see. I was looking to see whether you were using an sstable-based
>> import process, which has many pitfalls. Although you did mention it
>> was Java based in the original post.
>>
>>
>> Do you have anything reported in Scylla logs?
>>
> We watched the log during import but no error and no exception were shown.
>
>
>> Do you have other partitions at the same token as the missing partitions?
>>
> We know only of 9 partitions which hold two rows. Either both rows are
> or none.


Not quite what I meant. Suppose key 'key1' is missing (doesn't matter if
it was supposed to have one row or two). Calculate its token (using
Botond's method), call it t, and run


  SELECT * FROM tab WHERE token(key) = t


This will tell us whether we had a different key that happened to hit
the same token.


This is completely legal and shouldn't cause any problems, I'm looking
for hints.


Another thing to try is to restart all nodes in the cluster (in order to
clear their caches) and run a select for that key from cqlsh with
TRACING ON. This may provide more information.


>
>> Are the tokens of the missing partitions at vnode boundaries?
>>
> That means that I have to check the tokens of missing partitions against
> the tokenrange info I get from nodetool describering to see if these
> token are either the lower or the upper boundary of a range?
>
>

Yes, or off-by-one from those boundaries.


Again, no concrete known problem, just looking for hints.


Michael

<micha-1@fantasymail.de>
unread,
Jan 23, 2020, 9:29:22 AM1/23/20
to scylladb-users@googlegroups.com, Avi Kivity, Botond Dénes, Dor Laor

Am 23.01.20 um 14:09 schrieb Avi Kivity:
>
> On 23/01/2020 14.48, Michael wrote:
>> We know only of 9 partitions which hold two rows. Either both rows are
>> or none.
>
>
> Not quite what I meant. Suppose key 'key1' is missing (doesn't matter
> if it was supposed to have one row or two). Calculate its token (using
> Botond's method), call it t, and run
>
>
>   SELECT * FROM tab WHERE token(key) = t
>
>

I tried a few but did not found a duplicate. Testing more takes some time.


>
>
> Another thing to try is to restart all nodes in the cluster (in order
> to clear their caches) and run a select for that key from cqlsh with
> TRACING ON. This may provide more information.
>
The nodes were restarted yesterday.

Tracing doesn't work as there are two columns missing  in the system
tables. There is an issue (#4601) about this. I'm not sure if it's ok to
just add those columns or if there are other dependent modifications
necessary


>
>
> Yes, or off-by-one from those boundaries.
>
> Again, no concrete known problem, just looking for hints.
>
>

I tried a few, but none of those tokens were on a vnode boundary, or
off-by-one




Avi Kivity

<avi@scylladb.com>
unread,
Jan 23, 2020, 10:13:28 AM1/23/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Botond Dénes, Dor Laor

On 23/01/2020 16.29, Michael wrote:
> Am 23.01.20 um 14:09 schrieb Avi Kivity:
>> On 23/01/2020 14.48, Michael wrote:
>>> We know only of 9 partitions which hold two rows. Either both rows are
>>> or none.
>>
>> Not quite what I meant. Suppose key 'key1' is missing (doesn't matter
>> if it was supposed to have one row or two). Calculate its token (using
>> Botond's method), call it t, and run
>>
>>
>>   SELECT * FROM tab WHERE token(key) = t
>>
>>
> I tried a few but did not found a duplicate. Testing more takes some time.
>
>
>>
>> Another thing to try is to restart all nodes in the cluster (in order
>> to clear their caches) and run a select for that key from cqlsh with
>> TRACING ON. This may provide more information.
>>
> The nodes were restarted yesterday.
>
> Tracing doesn't work as there are two columns missing  in the system
> tables. There is an issue (#4601) about this. I'm not sure if it's ok to
> just add those columns or if there are other dependent modifications
> necessary


You can add the columns with ALTER TABLE.


>
>>
>> Yes, or off-by-one from those boundaries.
>>
>> Again, no concrete known problem, just looking for hints.
>>
>>
> I tried a few, but none of those tokens were on a vnode boundary, or
> off-by-one
>
>
>

Ok, let's see what the tracing results are.


Can you share the tokens for missing partitions, along with the number
of shards per node? I'll see if they match a shard boundary.

Michael

<micha-1@fantasymail.de>
unread,
Jan 23, 2020, 11:10:52 AM1/23/20
to scylladb-users@googlegroups.com, Avi Kivity, Botond Dénes, Dor Laor

Am 23.01.20 um 16:13 schrieb Avi Kivity:
>
> On 23/01/2020 16.29, Michael wrote:
>> Am 23.01.20 um 14:09 schrieb Avi Kivity:
>>
>>
>>>
>>> Another thing to try is to restart all nodes in the cluster (in order
>>> to clear their caches) and run a select for that key from cqlsh with
>>> TRACING ON. This may provide more information.
>>>
>>
here is the tracing of a select for a missing partition (more if you want):


cqlsh> select key from errorness_table where key =
0x3e6c39ce121da0d99a38e4f1c85cc09c;

 key
-----

(0 rows)

Tracing session: eef25710-3df6-11ea-b139-000000000004

 activity                                                                                                                                                                   
| timestamp                  | source       | source_elapsed | client
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+--------------+----------------+-----------
                                                                                                                                                         
Execute CQL3 query | 2020-01-23 15:42:16.833000 | 192.168.0.15
|              0 | 127.0.0.1
                                                                                                                                              
Parsing a statement [shard 0] | 2020-01-23 15:42:16.833733 |
192.168.0.15 |              1 | 127.0.0.1
                                                                                                                                           
Processing a statement [shard 0] | 2020-01-23 15:42:16.833837 |
192.168.0.15 |            104 | 127.0.0.1
 Creating read executor for token 2773306311014603495 with all:
{192.168.0.8, 192.168.0.9, 192.168.0.13} targets: {192.168.0.8,
192.168.0.9} repair decision: NONE [shard 0] | 2020-01-23
15:42:16.833968 | 192.168.0.15 |            236 | 127.0.0.1
                                                                                                                   
read_digest: sending a message to /192.168.0.9 [shard 0] | 2020-01-23
15:42:16.833984 | 192.168.0.15 |            251 | 127.0.0.1
                                                                                                                     
read_data: sending a message to /192.168.0.8 [shard 0] | 2020-01-23
15:42:16.834015 | 192.168.0.15 |            282 | 127.0.0.1
                                                                                                                   
read_data: message received from /192.168.0.15 [shard 0] | 2020-01-23
15:42:16.834615 |  192.168.0.8 |             22 | 127.0.0.1
                                                                                                                 
read_digest: message received from /192.168.0.15 [shard 0] | 2020-01-23
15:42:16.834664 |  192.168.0.9 |             14 | 127.0.0.1
                                                                                              
Start querying the token range that starts with 2773306311014603495
[shard 6] | 2020-01-23 15:42:16.834769 |  192.168.0.9 |             11 |
127.0.0.1
                                                                                              
Start querying the token range that starts with 2773306311014603495
[shard 6] | 2020-01-23 15:42:16.834789 |  192.168.0.8 |             18 |
127.0.0.1
                                                                                                                                                 
Querying is done [shard 6] | 2020-01-23 15:42:16.834829 |  192.168.0.9
|             71 | 127.0.0.1
                                                                                                                                                 
Querying is done [shard 6] | 2020-01-23 15:42:16.834885 |  192.168.0.8
|            115 | 127.0.0.1
                                                                                                
read_digest handling is done, sending a response to /192.168.0.15 [shard
0] | 2020-01-23 15:42:16.835066 |  192.168.0.9 |            416 | 127.0.0.1
                                                                                                  
read_data handling is done, sending a response to /192.168.0.15 [shard
0] | 2020-01-23 15:42:16.835286 |  192.168.0.8 |            692 | 127.0.0.1
                                                                                                                      
read_digest: got response from /192.168.0.9 [shard 0] | 2020-01-23
15:42:16.835507 | 192.168.0.15 |           1774 | 127.0.0.1
                                                                                                                        
read_data: got response from /192.168.0.8 [shard 0] | 2020-01-23
15:42:16.836269 | 192.168.0.15 |           2536 | 127.0.0.1
                                                                                                                             
Done processing - preparing a result [shard 0] | 2020-01-23
15:42:16.836328 | 192.168.0.15 |           2595 | 127.0.0.1
                                                                                                                                                           
Request complete | 2020-01-23 15:42:16.835612 | 192.168.0.15 |          
2612 | 127.0.0.1





>>
>
> Ok, let's see what the tracing results are.
>
>
> Can you share the tokens for missing partitions, along with the number
> of shards per node? I'll see if they match a shard boundary.
>

Some tokens from missing partitions (there are eight shards):
1228594666495504249, 3651180218813750844, 6082364005475862294,
2767066932624749567, 6386387569957585656, 8737956552098038433

Avi Kivity

<avi@scylladb.com>
unread,
Jan 26, 2020, 11:36:16 AM1/26/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Botond Dénes, Dor Laor
All seems in order with the trace, and there is nothing suspicious about
the tokens.


Is there a way you can replace the data with a random generator and so
provide a reproducer?


Do you have monitoring set up? If so please look at errors during the
import phase.


Michael

<micha-1@fantasymail.de>
unread,
Jan 27, 2020, 3:21:15 AM1/27/20
to scylladb-users@googlegroups.com, Avi Kivity, Botond Dénes, Dor Laor

Am 26.01.20 um 17:36 schrieb Avi Kivity:
>
>
>
> All seems in order with the trace, and there is nothing suspicious
> about the tokens.
>
>
> Is there a way you can replace the data with a random generator and so
> provide a reproducer?
>
>

A generator that outputs json / insert stmts for the table with random
values ?

Sure, json or inserts?



> Do you have monitoring set up? If so please look at errors during the
> import phase.
>

We don't have grafana monitoring. In the scylla logs there are no errors.





Avi Kivity

<avi@scylladb.com>
unread,
Jan 27, 2020, 7:04:31 AM1/27/20
to micha-1@fantasymail.de, scylladb-users@googlegroups.com, Botond Dénes, Dor Laor

On 27/01/2020 10.21, Michael wrote:
> Am 26.01.20 um 17:36 schrieb Avi Kivity:
>>
>>
>> All seems in order with the trace, and there is nothing suspicious
>> about the tokens.
>>
>>
>> Is there a way you can replace the data with a random generator and so
>> provide a reproducer?
>>
>>
> A generator that outputs json / insert stmts for the table with random
> values ?
>
> Sure, json or inserts?


Either would work. A generator that creates input for your importer,
plus your unmodified importer, or your importer changed to create random
data rather than read from disk.


I guess the first is better, because it allows us to create the dataset
just once and then try it again and again, plus it is more similar to
your original workflow.


>
>> Do you have monitoring set up? If so please look at errors during the
>> import phase.
>>
> We don't have grafana monitoring. In the scylla logs there are no errors.
>
>
>
>
>

I recommend setting it up. While I don't have high hopes that it will
help with the current problem, it may help with future problems. Plus
I'm curious about how some metrics look in an HDD environment.


Michael

<micha-1@fantasymail.de>
unread,
Jan 27, 2020, 11:02:48 AM1/27/20
to scylladb-users@googlegroups.com, Avi Kivity, Botond Dénes, Dor Laor

Am 27.01.20 um 13:04 schrieb Avi Kivity:
>
>
>
> I guess the first is better, because it allows us to create the
> dataset just once and then try it again and again, plus it is more
> similar to your original workflow.
>
I have made a go script (120lines)  which generates data which looks
like our data and outputs json lines format

What is the preferred way of sharing?


> I recommend setting it up. While I don't have high hopes that it will
> help with the current problem, it may help with future problems. Plus
> I'm curious about how some metrics look in an HDD environment.
>

ok, I'll try to get hardware for this.


Reply all
Reply to author
Forward
0 new messages