Tombstones & partition-deletes

273 views
Skip to first unread message

hor...@gmail.com

<horschi@gmail.com>
unread,
Dec 7, 2020, 12:46:17 PM12/7/20
to ScyllaDB users
Hi,

today I had a partition in scylla with quite a lot of tombstones, where lots of individual keys were deleted.

My plan to clean this mess up was to manually write a partition-delete and to major compact, to get rid of the single-key deletes. But this did not seem to work, the query was stilll quite slow. Unfortunetaly its hard to say because of https://github.com/scylladb/scylla/issues/3632 still being open.

Shouldn't the partition-delete make the individual key-deletes obsolete? Couldn't they be dropped upon compaction?

regards,
Christian

Avi Kivity

<avi@scylladb.com>
unread,
Dec 7, 2020, 1:08:39 PM12/7/20
to scylladb-users@googlegroups.com, hor...@gmail.com
Yes. You also need a "nodetool flush" to force the partition tombstone
to disk, in order to get it compacted with the row tombstones.


https://github.com/scylladb/scylla/pull/7690 tries to fail gracefully
(or less badly) in such cases. It will likely arrive in 4.4.


We also plan to support analytics queries on such data, where latency is
not important but having the query succeed is.

horschi

<horschi@gmail.com>
unread,
Dec 8, 2020, 12:37:10 PM12/8/20
to ScyllaDB users
For me it looks like its not removing the tombstones, as I did the following process and still have read times of 0.5 seconds on that empty partition:
- Manually write row delete (that should supersede the 1 Mio tombstones)
- flush + wait for it to settle
- compact
- Query takes 0.5 sec (before it was 4 seconds)

If tracing would show the number of tombstones, then I could tell for sure. But I think 0.5 seconds is due to the tombstones still being there. Is there any chance #3632 could be implemented soon?


Avi Kivity

<avi@scylladb.com>
unread,
Dec 13, 2020, 4:54:24 AM12/13/20
to scylladb-users@googlegroups.com, horschi

It may be that the tombstones are left in cache (#2252). You can verify by restarting the nodes, or querying with BYPASS CACHE.


wrt. #3632, I can't promise anything. You can take a stab at it yourself, if you like.

--
You received this message because you are subscribed to the Google Groups "ScyllaDB users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/CALY91SNtbXL-nWiNOE2mSFvGmT3AiVXczZgc6z0BEN1v3kscvQ%40mail.gmail.com.

hor...@gmail.com

<horschi@gmail.com>
unread,
Sep 20, 2021, 9:19:50 AM9/20/21
to ScyllaDB users
Hi Avi,

are there any plans to resolve the tombstone related tickets from this thread?

I was just testing again and I am easily able to produce a situation where I have to restart my scylla-server to get rid of my tombstones:

for(i=0;i<1000000;i++)
{
ck = rand();
update test set val='val0' where Pk='1' And Ck=ck;
# optional: delete from test where pk='1' and ck=ck;
delete from test where pk='1';
}

I have the suspicion that the tombstones are not even cleared by gc_grace.


Would it help if I provide a java program reproducing this issue?

regards,
Christian

horschi

<horschi@gmail.com>
unread,
Sep 20, 2021, 10:23:01 AM9/20/21
to ScyllaDB users
So I made a dummy that reproduces the issue. After running it, I can see that querying with bypass cache is much faster than querying with cache:
Reading with bypass cache took 1 ms
Reading took 1101 ms

(Of course this requires flushing/compacting in between runs)



Would this code help in some ticket?

// CREATE TABLE IF NOT EXISTS test
// (
// pk TEXT,
// ck TEXT,
// val TEXT,
// PRIMARY KEY ((pk), ck)
// );

final Session session = conn.getSession("mykeyspace");
final PreparedStatement statementUpdate = session.prepare("update test set val='some value' where pk='1' And ck=:ck");
final PreparedStatement statementDeleteSingle = session.prepare("delete from test where pk='1' and ck=:ck");
final PreparedStatement statementDeletePartition = session.prepare("delete from test where pk='1'");
final PreparedStatement statementSelect = session.prepare("select * from test where pk='1'");
final PreparedStatement statementSelectBypass = session.prepare("select * from test where pk='1' BYPASS CACHE");

final ResultSet resBefore = session.execute(statementSelect.bind()); // load to cache

for (int i = 0; i <= 10; i++)
{
final long start = System.currentTimeMillis();
final ResultSet res = session.execute(statementSelectBypass.bind());
System.out.println("" + i + " - Reading with bypass cache took " + (System.currentTimeMillis() - start) + " ms");
}


for (int i = 0; i <= 1000; i++)
{
for (int ii = 0; ii <= 10; ii++)
{
for (int iii = 0; iii <= 1000; iii++)
{
final String k = UUID.randomUUID().toString();
final BoundStatement boundUpdate = statementUpdate.bind();
boundUpdate.setString("ck", k);
session.execute(boundUpdate);

// final BoundStatement boundDelete = statementDeleteSingle.bind();
// boundDelete.setString("ck", k);
// session.execute(boundDelete);
}
session.execute(statementDeletePartition.bind()); // partition delete
}

final long start = System.currentTimeMillis();
final ResultSet res = session.execute(statementSelect.bind());
System.out.println("" + i + " - Reading took " + (System.currentTimeMillis() - start) + " ms");
}
session.execute(statementDeletePartition.bind()); // partition delete

{
final long start = System.currentTimeMillis();
final ResultSet res = session.execute(statementSelectBypass.bind());
System.out.println("Reading with bypass cache took " + (System.currentTimeMillis() - start) + " ms");
}

// Run compaction in between runs:
// nodetool flush 
// nodetool compact mykeyspace test
//
// sstabledump /var/lib/scylla/data/mykeyspace/test-*/*-Data.db


You received this message because you are subscribed to a topic in the Google Groups "ScyllaDB users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scylladb-users/o8hy57ELFTU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scylladb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-users/ae807f06-41dc-490d-9383-98f856ed17ean%40googlegroups.com.

Avi Kivity

<avi@scylladb.com>
unread,
Sep 22, 2021, 8:34:43 AM9/22/21
to scylladb-users@googlegroups.com, hor...@gmail.com

Yes, there are, although we're first clearing the more burning issue of range tombstones.

horschi

<horschi@gmail.com>
unread,
Sep 22, 2021, 8:51:29 AM9/22/21
to Avi Kivity, ScyllaDB users
Ok. I dont assume the tombstone cache expiration  (#6033) will be fixed as part of 4.5? 

Avi Kivity

<avi@scylladb.com>
unread,
Sep 22, 2021, 9:02:08 AM9/22/21
to horschi, ScyllaDB users

Unfortunately not.

Reply all
Reply to author
Forward
0 new messages