question about my.cnf options for TokuDB

MARK CALLAGHAN

unread,

May 9, 2016, 10:16:10 AM5/9/16

to percona-d...@googlegroups.com

Based on a discussion with Vadim I have been using the following config for a few benchmarks -- linkbench and insert benchmark (insert-only and insert with queries).

My results from TokuDB are frequently not good and I wonder whether the options are to blame. With one exception I have not experimented with changing these because there isn't much description for many of the vairables at https://www.percona.com/doc/percona-server/5.6/tokudb/tokudb_variables.html

The one exception is that someone suggested I try 'tokudb_fanout=128' and I did and it made linkbench much slower.

Are these options good enough? Are there other options I should set? I wonder if tokudb_checkpoint_pool_threads is large enough when I run tests on a many-core server.

tokudb_cache_size = ...
tokudb_directio = OFF
tokudb_empty_scan = disabled
tokudb_read_block_size = 16K
tokudb_commit_sync = OFF
tokudb_checkpointing_period = 900
tokudb_block_size = 4M
tokudb_cleaner_iterations = 10000
tokudb_checkpoint_pool_threads = 1
tokudb_enable_partial_eviction = OFF
tokudb_row_format = tokudb_zlib

--

Mark Callaghan
mdca...@gmail.com

George Lorch

unread,

May 9, 2016, 1:11:34 PM5/9/16

to Percona Discussion

There are a lot of moving parts here which makes it difficult to offer the 'right' answer for all cases but I will try to give you a few general guidelines/ideas:

If you have fast storage, both low latency and high throughput:

You might want your tokudb_fanout to be larger than the default of 16. Vadim has seen good results with a fanout of 128, going much larger than that proves diminishing returns. Generally, increasing fanout reduces concurrency on that tree due to PerconaFT node locking issues. Even at a fanout of 16 I have seen a basic sysbench test hitting serious concurrency issues with as few as 4 threads on a single table. If you increase the number of trees/tables, the concurrency issue becomes less visible. So adjust this upwards if you have a lot of trees, downwards if you have only a few trees but a lot of threads. Once we address the PerconaFT node locking issue, this will be more practical to increase the default fanout to some pretty high values and not risk cuncurrency issues. I hope to have something available for this later in the year.
You will want to use a smaller tokudb_block_size (node size). Dropping from the default of 4M to 1M helps but increases the liklihood that you will run into an issue in PerconaFT map file block allocation strategy. We have a fix pending for this that should allow smaller block sizes to be more efficient and thus allowing a more practical smaller block size that is a small multiple of the read block size. For now, 1M seems a good tradeoff for fast storage.

Checkpoint threads:

This one is kind of tricky. It depends on the load you are applying and if you have CPU cores going unused during your test. If you have cores going unused, then yes, you can safely increase the number of checkpoint threads, but this will only help up to a point. The checkpoint threads are where some (not all) of the PerconaFT block compression and writing occurs during a checkpoint. If you over allocate these threads, they will bite into the CPU availability of the user/query processing threads. In most test cases we have seen that the default allocation is in fact too high. You can monitor the size of the queued work this thread pool has during a checkpoint by watching the TokuDB status variables mentioned here https://www.percona.com/doc/percona-server/5.6/tokudb/tokudb_troubleshooting.html#global-status paying particular attention to:

TOKUDB_CACHETABLE_POOL_CHECKPOINT_NUM_THREADS
TOKUDB_CACHETABLE_POOL_CHECKPOINT_NUM_THREADS_ACTIVE
TOKUDB_CACHETABLE_POOL_CHECKPOINT_QUEUE_SIZE
TOKUDB_CACHETABLE_POOL_CHECKPOINT_MAX_QUEUE_SIZE
TOKUDB_CACHETABLE_POOL_CHECKPOINT_TOTAL_ITEMS_PROCESSED
TOKUDB_CACHETABLE_POOL_CHECKPOINT_TOTAL_EXECUTION_TIME

Basically, each 'item' in this queue is a PerconaFT node that has been cloned in memory and sent to this pool to be compressed, serialized and written to disk. The side effect of this is that these clones count against the cachetable memory limit and their existence, even for a short period of time, can force the cachetable evictor to begin aggressively clearing nodes out of the cachetable memory in order to stay below the memory cap.
So right now it is a balancing act to ensure you have enough checkpoint threads to prevent runaway memory utilization during checkpoint, but not so many that they take over all of your CPU and starve user request handling threads.

Checkpointing period:

This is another tricky one. The current period of 1 minute is pretty short for most systems and doesn't collect enough changes per checkpoint. In theory, if you never want to 'recover', you can go with an infinite checkpoint period but your shutdown checkpoint and crash recovery will also take infinitely long :) In most of our benchmarks both big and small we find value in increasing this up to at least 5-15 minutes, but the trade off is that when the checkpoint finally does run, it causes what we have come to call a 'clone storm' and usually shows in brief instability in performance during the first few seconds of checkpointing, causing user response times to skyrocket and throughput to drop to the floor. We are working on a few different angles of fixing checkpointing so this issue isn't quite as visible and acute.

Hope this helps some, I know it isn't a simple "set this to that and you're good" answer, but unfortunately, this is where TokuDB/PerconaFT is today. It was all originally designed and tuned for high latency, moderate throughput storage so many of the ideas embedded within start to fall apart once your storage capability starts catching up with your memory capability.

MARK CALLAGHAN

unread,

May 9, 2016, 1:22:35 PM5/9/16

to percona-d...@googlegroups.com

Thanks for the detailed response. I will run a test with more checkpoint threads to see if that helps. Using tokudb_fanout=128 for my tests produced much worse results in the recent past.

RocksDB has the concept of fanout but I am not sure it similar to the TokuDB concept of fanout. In RocksDB were I to increase fanout from 16 to 128 I would increase write-amp by 8X and that is not desired. Does increasing fanout for TokuDB have the same side effect of increasing write-amp?

--
You received this message because you are subscribed to the Google Groups "Percona Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to percona-discuss...@googlegroups.com.
To post to this group, send email to percona-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Mark Callaghan
mdca...@gmail.com

George O. Lorch III

unread,

May 9, 2016, 3:31:01 PM5/9/16

to percona-d...@googlegroups.com

I believe in theory, yes, it would increase write-amp and for roughly the same reason. The longer checkpointing period though (when nodes are _actually_ written to disk) helps re-fold this back upon itself and reduce the observable amplification. It again really depends on how the workload is spread across nodes in the tree and how many node writes can be collected into a single checkpoint period. Increasing the fanout is more of a read optimization than for write efficiency.

You received this message because you are subscribed to a topic in the Google Groups "Percona Discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/percona-discussion/iDEymBKnjdw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to percona-discuss...@googlegroups.com.

To post to this group, send email to percona-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
George O. Lorch III
Software Engineer, Percona
US/Arizona (GMT -7)
skype: george.ormond.lorch.iii

MARK CALLAGHAN

unread,

May 11, 2016, 1:42:38 PM5/11/16

to percona-d...@googlegroups.com

Using my iibench scripts with inserts & queries enabled changing from the default, tokudb_fanout=16, to tokudb_fanout=32 doubled write-amplification as predicted and decreased the write throughput. This was expected and not something I want for the workload.

Changing checkpoint threads from 1 (the default) to 2 increased the peak insert rate by ~8% without any bad side effects. I will continue to use 2 checkpoint threads for future tests.

Vadim Tkachenko

unread,

May 11, 2016, 6:20:00 PM5/11/16

to percona-discussion

Mark,

With checkpoint threads the point to consider not only peak rate, but
also variance in throughput during checkpoint.
The more threads are allocated for checkpointing - the more resources
are taken from user threads and the bigger performance drop you may
observe. But on other hand it will be shorter, as checkpoint will
finish faster.

Vadim Tkachenko, CTO, Percona
Phone +1-919-249-7483, Skype: vadimtk153

MARK CALLAGHAN

unread,

May 11, 2016, 6:43:55 PM5/11/16

to percona-d...@googlegroups.com

I don't see more variance from this for insert-only insert benchmark. When I add queries concurrent with rate-limited inserts there is another problem that is so significant that not much else matters. The problem is the per node locking done when the Toku equivalent of compaction is in progress. I am not sure whether it is a mutex or a RW-lock, but that doesn't matter much. The end result is that both insert and query performance are much worse than I expect. I have been using tokudb_block_size=4M and will repeat a few tests with something smaller.

MARK CALLAGHAN

unread,

May 24, 2016, 1:37:20 PM5/24/16

to percona-d...@googlegroups.com

tokudb_block_size=1M helped for part of the test but hurt other parts. I tried changing tokudb_cleaner_iterations to a smaller value than 10,000 but that made things worse.

The problem I have on the insert benchmark when there are inserts and queries at the same time is that the QPS for queries drops to zero. This reproduces on servers with fast SSD and with a small disk array.

--

Mark Callaghan
mdca...@gmail.com

George Lorch

unread,

May 24, 2016, 1:48:18 PM5/24/16

to Percona Discussion

If it is periodic it is likely the checkpointing issues in FT, otherwise it could be the internal FT node locking contention issue. With a single write lock high in a tree you effectively block all access to everything from that particular node downwards. The FT node locking is entirely too coarse when it comes to write locks and the huge node sizes (when compared to innodb) exacerbate the problem as now a single write lock locks a much more significant portion of the data set.

We've got a few things coming down the pike to address the checkpointing issues and the node lock granularity among other things but they are all pretty invasive ideas so proving them out and testing will take quite some time before they see the real world.

>>> email to percona-discussion+unsub...@googlegroups.com.

>>> To post to this group, send email to percona-d...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>> --
>> Mark Callaghan
>> mdca...@gmail.com
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Percona Discussion" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/percona-discussion/iDEymBKnjdw/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to

>> percona-discussion+unsub...@googlegroups.com.

>> To post to this group, send email to percona-d...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> George O. Lorch III
>> Software Engineer, Percona
>> US/Arizona (GMT -7)
>> skype: george.ormond.lorch.iii
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Percona Discussion" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> email to percona-discussion+unsub...@googlegroups.com.

>> To post to this group, send email to percona-d...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Mark Callaghan
> mdca...@gmail.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "Percona Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to percona-discussion+unsub...@googlegroups.com.

> To post to this group, send email to percona-d...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--

Vadim Tkachenko, CTO, Percona
Phone +1-919-249-7483, Skype: vadimtk153

--
You received this message because you are subscribed to the Google Groups "Percona Discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to percona-discussion+unsub...@googlegroups.com.

To post to this group, send email to percona-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Mark Callaghan
mdca...@gmail.com

--
Mark Callaghan
mdca...@gmail.com

MarkCallaghan

unread,

Dec 23, 2016, 12:04:45 PM12/23/16

to Percona Discussion

I am not asking you to fix this. I am sharing here so you don't learn of this first when I share results in pubic.

Upgraded to TokuDB 5.7.15 from Percona Server, running the insert benchmark using https://github.com/mdcallag/mytools/blob/master/bench/ibench/iq.sh. There are 3 phases in this test, insert only, inserts rate limited to 1000/second with concurrent queries, inserts rate limited to 100/second with concurrent queries. The query rates drop to 0 per second during the second and third phases.

This isn't bad query plans from bad index cardinality stats. I checked the results for SHOW INDEXES and SHOW TABLE STATUS and the numbers are good. The QPS rate drops because the queries do full (or long) index scans when they should be doing short index only scans on covering secondary indexes. InnoDB in MySQL 5.6.26 and 5.7.10 is OK on this. MyRocks from FB MySQL is OK on this. The problem is that TokuDB uses type:index when other engines use type:range for the index access. Is something ignoring the use of the LIMIT clause?

By accident the tests used values for tokudb_analyze_mode that are required to avoid bad query plans with sysbench. I might undo that for a few tests, but the index stats are good in this test. They don't drift to zero -- maybe that is a problem for update statements.

Slow query log output:

# Time: 2016-12-23T16:30:55.336869Z

# User@Host: root[root] @ localhost [127.0.0.1] Id: 2524

# Schema: ib Last_errno: 0 Killed: 0

# Query_time: 17.990052 Lock_time: 0.000068 Rows_sent: 10 Rows_examined: 41361128 Rows_affected: 0

# Bytes_sent: 540

SET timestamp=1482510655;

SELECT price,dateandtime,customerid FROM pi8 FORCE INDEX (pdc) WHERE (price>=319.30) ORDER BY price,dateandtime,customerid LIMIT 10;

# Time: 2016-12-23T16:30:56.928372Z

# User@Host: root[root] @ localhost [127.0.0.1] Id: 2544

# Schema: ib Last_errno: 0 Killed: 0

# Query_time: 29.310352 Lock_time: 0.000126 Rows_sent: 10 Rows_examined: 67533688 Rows_affected: 0

# Bytes_sent: 540

SET timestamp=1482510656;

SELECT price,dateandtime,customerid FROM pi9 FORCE INDEX (pdc) WHERE (price>=519.87) ORDER BY price,dateandtime,customerid LIMIT 10;

# Time: 2016-12-23T16:30:59.099488Z

# User@Host: root[root] @ localhost [127.0.0.1] Id: 2533

# Schema: ib Last_errno: 0 Killed: 0

# Query_time: 7.612549 Lock_time: 0.000059 Rows_sent: 10 Rows_examined: 18340078 Rows_affected: 0

# Bytes_sent: 546

SET timestamp=1482510659;

SELECT price,dateandtime,customerid FROM pi12 FORCE INDEX (pdc) WHERE (price>=143.04) ORDER BY price,dateandtime,customerid LIMIT 10;

# Time: 2016-12-23T16:31:01.702954Z

# User@Host: root[root] @ localhost [127.0.0.1] Id: 2555

# Schema: ib Last_errno: 0 Killed: 0

# Query_time: 17.916199 Lock_time: 0.000052 Rows_sent: 10 Rows_examined: 42871166 Rows_affected: 0

# Bytes_sent: 540

SET timestamp=1482510661;

SELECT price,dateandtime,customerid FROM pi2 FORCE INDEX (pdc) WHERE (price>=330.87) ORDER BY price,dateandtime,customerid LIMIT 10;

Explain of a problem query for TokuDB

mysql> explain SELECT price,dateandtime,customerid FROM pi5 FORCE INDEX (pdc) WHERE (price>=389.23) ORDER BY price,dateandtime,customerid LIMIT 10\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: pi5

partitions: NULL

type: index

possible_keys: pdc

key: pdc

key_len: 14

ref: NULL

rows: 10

filtered: 100.00

Extra: Using where; Using index

1 row in set, 1 warning (0.00 sec)

Explain of that problem query for InnoDB 5.7.10

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: pi5

partitions: NULL

type: range

possible_keys: pdc

key: pdc

key_len: 4

ref: NULL

rows: 64877752

filtered: 100.00

Extra: Using where; Using index

1 row in set, 1 warning (0.00 sec)

Table for problem query:

mysql> show create table pi5\G

*************************** 1. row ***************************

Table: pi5

Create Table: CREATE TABLE `pi5` (

`transactionid` int(11) NOT NULL AUTO_INCREMENT,

`dateandtime` datetime DEFAULT NULL,

`cashregisterid` int(11) NOT NULL,

`customerid` int(11) NOT NULL,

`productid` int(11) NOT NULL,

`price` float NOT NULL,

`data` varchar(4000) DEFAULT NULL,

PRIMARY KEY (`transactionid`),

KEY `marketsegment` (`price`,`customerid`),

KEY `registersegment` (`cashregisterid`,`price`,`customerid`),

KEY `pdc` (`price`,`dateandtime`,`customerid`)

) ENGINE=TokuDB AUTO_INCREMENT=130575101 DEFAULT CHARSET=latin1

1 row in set (0.00 sec)

Index stats:

mysql> show indexes from pi5;

+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

| pi5 | 0 | PRIMARY | 1 | transactionid | A | 130577696 | NULL | NULL | | BTREE | | |

| pi5 | 1 | marketsegment | 1 | price | A | 132298 | NULL | NULL | | BTREE | | |

| pi5 | 1 | marketsegment | 2 | customerid | A | 65288848 | NULL | NULL | | BTREE | | |

| pi5 | 1 | registersegment | 1 | cashregisterid | A | 1086 | NULL | NULL | | BTREE | | |

| pi5 | 1 | registersegment | 2 | price | A | 130577696 | NULL | NULL | | BTREE | | |

| pi5 | 1 | registersegment | 3 | customerid | A | 130577696 | NULL | NULL | | BTREE | | |

| pi5 | 1 | pdc | 1 | price | A | 132971 | NULL | NULL | | BTREE | | |

| pi5 | 1 | pdc | 2 | dateandtime | A | 130577696 | NULL | NULL | YES | BTREE | | |

| pi5 | 1 | pdc | 3 | customerid | A | 130577696 | NULL | NULL | | BTREE | | |

+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

mysql> show table status like "pi5"\G

*************************** 1. row ***************************

Name: pi5

Engine: TokuDB

Version: 10

Row_format: tokudb_zlib

Rows: 130580100

Avg_row_length: 43

Data_length: 5614946640

Max_data_length: 9223372036854775807

Index_length: 14448534548

Data_free: 49168380

Auto_increment: 130580101

Create_time: 2016-12-22 20:30:16

Update_time: 2016-12-23 08:39:41

Check_time: NULL

Collation: latin1_swedish_ci

Checksum: NULL

Create_options:

Comment:

1 row in set (0.00 sec)

/etc/my.cnf

[mysqld]

sql_mode=no_engine_substitution,no_unsigned_subtraction

default_tmp_storage_engine=MyISAM

tmpdir=/data/mysql/toku

query_cache_size=0

query_cache_type=0

slow_query_log_file=/data/mysql/toku/slow.log

long_query_time=5

max_binlog_size=128M

sync_binlog=0

server_id=77

datadir=/data/mysql/toku/data

tokudb_cache_size = 10G

tokudb_directio = OFF

tokudb_empty_scan = disabled

tokudb_read_block_size = 16K

tokudb_commit_sync = OFF

tokudb_checkpointing_period = 300

tokudb_block_size = 4M

# tokudb_block_size = 1M

tokudb_cleaner_iterations = 10000

tokudb_enable_partial_eviction = OFF

# tokudb_fanout = 32

# tokudb_checkpoint_pool_threads = 2

tokudb_row_format = tokudb_zlib

#tokudb_row_format = tokudb_snappy

#tokudb_row_format = tokudb_uncompressed

tokudb_fsync_log_period = 1000

#default-storage-engine = tokudb

#skip-innodb

default-tmp-storage-engine=MyISAM

#skip_log_bin

log_bin=/binlogs/myrocks/bl

binlog_format=row

tokudb_disable_prefetching = ON

tokudb_analyze_in_background = OFF

tokudb_analyze_time=10

#tokudb_analyze_mode=TOKUDB_ANALYZE_STANDARD

tokudb_analyze_mode=TOKUDB_ANALYZE_RECOUNT_ROWS

#tokudb_auto_analyze=0

slow_query_log = ON

long_query_time=1

#skip_log_bin

tokudb_cache_size = 10G

[mysqld-safe]

malloc-lib=/usr/lib64/libjemalloc.so.1

MarkCallaghan

unread,

Dec 23, 2016, 12:09:22 PM12/23/16

to Percona Discussion

I get a good query plan (type:range) after running ANALYZE TABLE. However, index/table stats aren't different after analyze. The workaround for me it to hack my benchmark scripts to add analyze commands.

mysql> show indexes from pi5;

+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

| pi5 | 0 | PRIMARY | 1 | transactionid | A | 130631104 | NULL | NULL | | BTREE | | |

| pi5 | 1 | marketsegment | 1 | price | A | 132352 | NULL | NULL | | BTREE | | |

| pi5 | 1 | marketsegment | 2 | customerid | A | 65315552 | NULL | NULL | | BTREE | | |

| pi5 | 1 | registersegment | 1 | cashregisterid | A | 1086 | NULL | NULL | | BTREE | | |

| pi5 | 1 | registersegment | 2 | price | A | 130631104 | NULL | NULL | | BTREE | | |

| pi5 | 1 | registersegment | 3 | customerid | A | 130631104 | NULL | NULL | | BTREE | | |

| pi5 | 1 | pdc | 1 | price | A | 133026 | NULL | NULL | | BTREE | | |

| pi5 | 1 | pdc | 2 | dateandtime | A | 130631104 | NULL | NULL | YES | BTREE | | |

| pi5 | 1 | pdc | 3 | customerid | A | 130631104 | NULL | NULL | | BTREE | | |

+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

9 rows in set (0.00 sec)

mysql> show table status like "pi5"\G

*************************** 1. row ***************************

Name: pi5

Engine: TokuDB

Version: 10

Row_format: tokudb_zlib

Rows: 130631100

Avg_row_length: 43

Data_length: 5617139452

Max_data_length: 9223372036854775807

Index_length: 6181893281

Data_free: 43679740

Auto_increment: 130631101

Create_time: 2016-12-22 20:30:16

Update_time: 2016-12-23 08:55:22

Check_time: NULL

Collation: latin1_swedish_ci

Checksum: NULL

Create_options:

Comment:

1 row in set (0.00 sec)

mysql> explain SELECT price,dateandtime,customerid FROM pi5 FORCE INDEX (pdc) WHERE (price>=389.23) ORDER BY price,dateandtime,customerid LIMIT 10\G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: pi5

partitions: NULL

type: range

possible_keys: pdc

key: pdc

key_len: 4

ref: NULL

rows: 80702611

MarkCallaghan

unread,

Dec 24, 2016, 10:33:11 AM12/24/16

to Percona Discussion

Tried running analyze table prior to the test phases that do queries with reads. That briefly prevents the problem but soon query plans switch from type:range (good) to type:index (bad). I set tokudb_analyze_in_background=false and all tokudb stats options were then...

tokudb_analyze_delete_fraction 1.000000

tokudb_analyze_in_background OFF

tokudb_analyze_mode TOKUDB_ANALYZE_STANDARD

tokudb_analyze_throttle 0

tokudb_analyze_time 5

tokudb_auto_analyze 30

At test end I again did analyze table for all tables and the plan was still bad (type:index). Then I restarted mysqld and after that the plan was good (type:range).

I also get intermittent errors with analyze table. There is no error message in the db error log. Most recent error occurred with this config:

tokudb_analyze_delete_fraction 1.000000

tokudb_analyze_in_background ON

tokudb_analyze_mode TOKUDB_ANALYZE_STANDARD

tokudb_analyze_throttle 0

tokudb_analyze_time 5

tokudb_auto_analyze 30

Here is one example of the failure, the table for which it fails changes over time

# for x in $( seq 1 8 ); do ( mysql -uroot -ppw ib -e "analyze table pi$x" & ); done

Table Op Msg_type Msg_text

ib.pi1 analyze status OK

Table Op Msg_type Msg_text

ib.pi2 analyze status OK

Table Op Msg_type Msg_text

ib.pi8 analyze status Operation failed

Table Op Msg_type Msg_text

ib.pi4 analyze status OK

Table Op Msg_type Msg_text

ib.pi5 analyze status OK

Table Op Msg_type Msg_text

ib.pi3 analyze status OK

ib.pi7 analyze status OK

Table Op Msg_type Msg_text

ib.pi6 analyze status OK

MarkCallaghan

unread,

Dec 26, 2016, 2:40:55 PM12/26/16

to Percona Discussion

Despite my best efforts I can't make this work where "this" is queries concurrent with inserts with my scripts - https://github.com/mdcallag/mytools/blob/master/bench/ibench/iq.sh

I tried recollecting stats but that didn't work. What works, at least briefly, is to restart mysqld. By works I mean that the query plan uses type:range rather than type:index. But after ~30 minutes the query plan returns to full index scans (type:index) and only restarting mysqld fixes that.

George Lorch

unread,

Jan 2, 2017, 3:33:58 PM1/2/17

to Percona Discussion

Hey Mark,

Sorry for the delayed response, I have been away for a few weeks and unable to research or respond.

So thanks for digging into this. This behavior is very curious. I can not explain why the optimizer would change the query plan choice if the engine is reporting roughly the same row count and index statistics and a restart of the server clears the issue. It seems maybe that there is some other kind of transient state that the optimizer is relying on that isn't overly visibly via the CLI that is getting skewed after some period of running. Once I get all caught up on emails and other issues I will see if I can reproduce this 'bad plan choice' and identify the cause.

--

George O. Lorch III

Senior Software Engineer, Percona

George Lorch

unread,

Jan 3, 2017, 4:38:06 PM1/3/17

to Percona Discussion

I am working on a reproducer for this but one thing I notices is your use of tokudb_auto_analyze. To totally disable automatic analysis, this must be set to 0. The default is 10 and in one of your tests above you have it set to 30. This value can be defined as "at what percentage of table change (insert/update/delete) should an analyze be triggered". So as an example, if you set this to 30, have 1000 rows, an analyze will be triggered once you have dome some combination of 300 inserts/updates/deletes.

With the non default value of tokudb_analyze_in_background=off that you are using, this means that this analysis will take place within the context of the user thread that causes this threshold to be hit. Since you have your tokudb_analyze_time=5 and 10 on different tests and a fairly large number of rows, the client/user threads that trigger the analyze will potentially take 5 or 10 seconds to do the analysis each time the analysis threshold is hit.

This is specifically why we implemented the automatic analysis as a background task and set all defaults for it to operate that way. The only reason we allow it to analyze in the foreground is to continue to mimick the traditional way ANALYZE TABLE operates for those that want to script explicit analysis invocation and know when it is completes.

So this is just all a warning that overriding the defaults to use automatic, foreground analysis will skew your results in weird wais by introducing the analyze time as pary of a insert/update/delete response time periodically. You should either leave it as default tokudb_analyze_in_background=ON with your existing tokudb_auto_analyze=10|30 and tokudb_analyze_time=5|10, or, you should just disable automatic analysis entirely with tokudb_auto_analyze=0 and stick with whatever index cardinality stats were collected from the last time you analyzed.

Either way, I don't believe this has anything to do with a query plan type changing between index and range since you say that the row counts remain stable and cardinality stats also stay within the expected range unless the optimizer is somehow noticing this excessive time on some of the inserts and altering the query plan decision.

Reply all

Reply to author

Forward