linkbench, TokuDB and range scans on non-covering secondary indexes

MARK CALLAGHAN

unread,

Aug 17, 2016, 11:30:56 AM8/17/16

to percona-d...@googlegroups.com

I am using TokuDB from Percona Server 5.6.26-74.

I am sharing this because one day I might report perf results for some of the tests I do. I don't need a fix or an answer for this problem. But I don't want to surprise you with this result.

I was curious about the amount of extra space used to make the linktable secondary index covering for the most frequent query in linkbench. I ran the test for MyRocks and InnoDB without problems. Making the id1_type index non-covering saves space but reduces TPS for linkbench by about 20%.

My results with TokuDB were much worse. In one configuration I get more than 10,000 QPS with the covering secondary index and approximately 0 TPS with the non-covering secondary index. The DDL is below for the linktable. TPS drops to zero because the short range scan on linktable takes 20+ minutes when the secondary index is non-covering. This paste has an example query, explain output, PMP output and Linux perf output -- https://gist.github.com/mdcallag/4b3049933a345864b2943186a1c357a0

The query plan looks OK. My PMP reading skills for TokuDB are not great. It looks like there is too much contention on frwlock but maybe that isn't the only problem. Even when I only have one client running one copy of the problem query, the problem query still takes many minutes, maybe 10 or 20, to finish.

For this test I used maxid1=1B and the test table is ~300G with TokuDB. The my.cnf for TokuDB was https://gist.github.com/mdcallag/90b5ca45bfcbe48970757b84c9e0d983

The standard DDL is:

CREATE TABLE `linktable` (
`id1` bigint(20) unsigned NOT NULL DEFAULT '0',
`id1_type` int(10) unsigned NOT NULL DEFAULT '0',
`id2` bigint(20) unsigned NOT NULL DEFAULT '0',
`id2_type` int(10) unsigned NOT NULL DEFAULT '0',
`link_type` bigint(20) unsigned NOT NULL DEFAULT '0',
`visibility` tinyint(3) NOT NULL DEFAULT '0',
`data` varchar(255) NOT NULL DEFAULT '',
`time` bigint(20) unsigned NOT NULL DEFAULT '0',
`version` int(11) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (link_type, `id1`,`id2`) COMMENT 'cf_link_pk',
KEY `id1_type` (`id1`,`link_type`,`visibility`,`time`,`id2`,`version`,`data`)
) ENGINE=TokuDB;

The non-covering version of id1_type is:

CREATE TABLE `linktable` (
`id1` bigint(20) unsigned NOT NULL DEFAULT '0',
`id1_type` int(10) unsigned NOT NULL DEFAULT '0',
`id2` bigint(20) unsigned NOT NULL DEFAULT '0',
`id2_type` int(10) unsigned NOT NULL DEFAULT '0',
`link_type` bigint(20) unsigned NOT NULL DEFAULT '0',
`visibility` tinyint(3) NOT NULL DEFAULT '0',
`data` varchar(255) NOT NULL DEFAULT '',
`time` bigint(20) unsigned NOT NULL DEFAULT '0',
`version` int(11) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (link_type, `id1`,`id2`) COMMENT 'cf_link_pk',
KEY `id1_type` (`id1`,`link_type`,`visibility`,`time`)
) ENGINE=TokuDB;

--

Mark Callaghan
mdca...@gmail.com

George Lorch

unread,

Aug 17, 2016, 1:41:36 PM8/17/16

to Percona Discussion

Hey Mark,

A few things you're probably seeing here:

First, yes, TokuDB has poor concurrency on single tables/indices, specifically at the node frwlock. We have discussed this some before (https://groups.google.com/d/msg/percona-discussion/iDEymBKnjdw/YItXpUzOAAAJ). The very goal of concentrating writes via messaging and using huge nodes is also exactly how you would design with a goal of achieving poor concurrency. All FT nodes have a single lock that is a writer priority lock for the entire node which is entirely too coarse when you are talking about a node that is 4M in size. What is rather ironic is that the 'f' in frwlock is supposed to mean 'fair', but it is not fair at all. We are trying to pry apart some of layers in there and create a more granular node locking scheme but it is quite difficult as the the code is...ahem...shall we say, a bit obtuse and there is a strong ripple effect when touching the internal FT locking.

Second, that specific query is possibly hitting this https://tokutek.atlassian.net/browse/DB-233 and this https://tokutek.atlassian.net/browse/DB-534 and this https://tokutek.atlassian.net/browse/DB-988. They are all the same, basically descending scan w/ ICP causes the reverse scan to blow through the lower range limit and scan until it hits the beginning of the index. This is fixed in PS 5.6.30-76.3.

A possible third thing we have also discussed before is the linear list search down in the FT map file block allocator (https://tokutek.atlassian.net/browse/FT-724) which is addressed in the upcoming PS 5.6.32-78.0 release and will have an interesting blog post on MPB describing the issue in detail. This really kills the checkpointer, which then takes the rest of the engine down with it. We know that smaller node sizes are ideal for fast storage and single index concurrency issues, but, in huge data they really aggravate this block allocator issue.

If you want to try to prove/disprove the second point, remove the descending index scans and see if there is any change in the result (or upgrade to 5.6.30-76-3). Ideally, wait for a week or so for PS 5.6.32-78.0 to drop or just do a PS 5.6 trunk build if you are feeling adventurous and retry your test.

--
George O. Lorch III
Software Engineer, Percona
US/Arizona (GMT -7)

Peter Zaitsev

unread,

Aug 17, 2016, 9:33:12 PM8/17/16

to percona-discussion

Hi Mark,

Thanks for Testing TokuDB and helping us to spot issues :)

--
You received this message because you are subscribed to the Google Groups "Percona Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to percona-discussion+unsub...@googlegroups.com.
To post to this group, send email to percona-discussion@googlegroups.com.
Visit this group at https://groups.google.com/group/percona-discussion.
To view this discussion on the web visit https://groups.google.com/d/msgid/percona-discussion/CAFbpF8PRcRk_ru-w%3DHVWq%2BQxVx6OvSXRJd98ASu_QkP3fcfsBA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev

Reply all

Reply to author

Forward