question about my.cnf options for TokuDB

328 views
Skip to first unread message

MARK CALLAGHAN

unread,
May 9, 2016, 10:16:10 AM5/9/16
to percona-d...@googlegroups.com
Based on a discussion with Vadim I have been using the following config for a few benchmarks -- linkbench and insert benchmark (insert-only and insert with queries). 

My results from TokuDB are frequently not good and I wonder whether the options are to blame. With one exception I have not experimented with changing these because there isn't much description for many of the vairables at https://www.percona.com/doc/percona-server/5.6/tokudb/tokudb_variables.html

The one exception is that someone suggested I try 'tokudb_fanout=128' and I did and it made linkbench much slower.

Are these options good enough? Are there other options I should set? I wonder if tokudb_checkpoint_pool_threads is large enough when I run tests on a many-core server.


tokudb_cache_size               = ...
tokudb_directio                 = OFF
tokudb_empty_scan               = disabled
tokudb_read_block_size          = 16K
tokudb_commit_sync              = OFF
tokudb_checkpointing_period     = 900
tokudb_block_size               = 4M
tokudb_cleaner_iterations       = 10000
tokudb_checkpoint_pool_threads  = 1
tokudb_enable_partial_eviction  = OFF
tokudb_row_format               = tokudb_zlib

--
Mark Callaghan
mdca...@gmail.com

George Lorch

unread,
May 9, 2016, 1:11:34 PM5/9/16
to Percona Discussion
There are a lot of moving parts here which makes it difficult to offer the 'right' answer for all cases but I will try to give you a few general guidelines/ideas:
  1. If you have fast storage, both low latency and high throughput:
    1. You might want your tokudb_fanout to be larger than the default of 16. Vadim has seen good results with a fanout of 128, going much larger than that proves diminishing returns. Generally, increasing fanout reduces concurrency on that tree due to PerconaFT node locking issues. Even at a fanout of 16 I have seen a basic sysbench test hitting serious concurrency issues with as few as 4 threads on a single table. If you increase the number of trees/tables, the concurrency issue becomes less visible. So adjust this upwards if you have a lot of trees, downwards if you have only a few trees but a lot of threads. Once we address the PerconaFT node locking issue, this will be more practical to increase the default fanout to some pretty high values and not risk cuncurrency issues. I hope to have something available for this later in the year.
    2. You will want to use a smaller tokudb_block_size (node size). Dropping from the default of 4M to 1M helps but increases the liklihood that you will run into an issue in PerconaFT map file block allocation strategy. We have a fix pending for this that should allow smaller block sizes to be more efficient and thus allowing a more practical smaller block size that is a small multiple of the read block size. For now, 1M seems a good tradeoff for fast storage.
  2. Checkpoint threads:
    1. This one is kind of tricky. It depends on the load you are applying and if you have CPU cores going unused during your test. If you have cores going unused, then yes, you can safely increase the number of checkpoint threads, but this will only help up to a point. The checkpoint threads are where some (not all) of the PerconaFT block compression and writing occurs during a checkpoint. If you over allocate these threads, they will bite into the CPU availability of the user/query processing threads. In most test cases we have seen that the default allocation is in fact too high. You can monitor the size of the queued work this thread pool has during a checkpoint by watching the TokuDB status variables mentioned here https://www.percona.com/doc/percona-server/5.6/tokudb/tokudb_troubleshooting.html#global-status paying particular attention to:
      1. TOKUDB_CACHETABLE_POOL_CHECKPOINT_NUM_THREADS
      2. TOKUDB_CACHETABLE_POOL_CHECKPOINT_NUM_THREADS_ACTIVE
      3. TOKUDB_CACHETABLE_POOL_CHECKPOINT_QUEUE_SIZE
      4. TOKUDB_CACHETABLE_POOL_CHECKPOINT_MAX_QUEUE_SIZE
      5. TOKUDB_CACHETABLE_POOL_CHECKPOINT_TOTAL_ITEMS_PROCESSED
      6. TOKUDB_CACHETABLE_POOL_CHECKPOINT_TOTAL_EXECUTION_TIME
    2. Basically, each 'item' in this queue is a PerconaFT node that has been cloned in memory and sent to this pool to be compressed, serialized and written to disk. The side effect of this is that these clones count against the cachetable memory limit and their existence, even for a short period of time, can force the cachetable evictor to begin aggressively clearing nodes out of the cachetable memory in order to stay below the memory cap.
    3. So right now it is a balancing act to ensure you have enough checkpoint threads to prevent runaway memory utilization during checkpoint, but not so many that they take over all of your CPU and starve user request handling threads.
  3. Checkpointing period:
    1. This is another tricky one. The current period of 1 minute is pretty short for most systems and doesn't collect enough changes per checkpoint. In theory, if you never want to 'recover', you can go with an infinite checkpoint period but your shutdown checkpoint and crash recovery will also take infinitely long :) In most of our benchmarks both big and small we find value in increasing this up to at least 5-15 minutes, but the trade off is that when the checkpoint finally does run, it causes what we have come to call a 'clone storm' and usually shows in brief instability in performance during the first few seconds of checkpointing, causing user response times to skyrocket and throughput to drop to the floor. We are working on a few different angles of fixing checkpointing so this issue isn't quite as visible and acute.
Hope this helps some, I know it isn't a simple "set this to that and you're good" answer, but unfortunately, this is where TokuDB/PerconaFT is today. It was all originally designed and tuned for high latency, moderate throughput storage so many of the ideas embedded within start to fall apart once your storage capability starts catching up with your memory capability. 

MARK CALLAGHAN

unread,
May 9, 2016, 1:22:35 PM5/9/16
to percona-d...@googlegroups.com
Thanks for the detailed response. I will run a test with more checkpoint threads to see if that helps. Using tokudb_fanout=128 for my tests produced much worse results in the recent past.

RocksDB has the concept of fanout but I am not sure it similar to the TokuDB concept of fanout. In RocksDB were I to increase fanout from 16 to 128 I would increase write-amp by 8X and that is not desired. Does increasing fanout for TokuDB have the same side effect of increasing write-amp?

--
You received this message because you are subscribed to the Google Groups "Percona Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to percona-discuss...@googlegroups.com.
To post to this group, send email to percona-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Mark Callaghan
mdca...@gmail.com

George O. Lorch III

unread,
May 9, 2016, 3:31:01 PM5/9/16
to percona-d...@googlegroups.com
I believe in theory, yes, it would increase write-amp and for roughly the same reason. The longer checkpointing period though (when nodes are _actually_ written to disk) helps re-fold this back upon itself and reduce the observable amplification. It again really depends on how the workload is spread across nodes in the tree and how many node writes can be collected into a single checkpoint period. Increasing the fanout is more of a read optimization than for write efficiency.
You received this message because you are subscribed to a topic in the Google Groups "Percona Discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/percona-discussion/iDEymBKnjdw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to percona-discuss...@googlegroups.com.

To post to this group, send email to percona-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
George O. Lorch III
Software Engineer, Percona
US/Arizona (GMT -7)
skype: george.ormond.lorch.iii

MARK CALLAGHAN

unread,
May 11, 2016, 1:42:38 PM5/11/16
to percona-d...@googlegroups.com
Using my iibench scripts with inserts & queries enabled changing from the default, tokudb_fanout=16, to tokudb_fanout=32 doubled write-amplification as predicted and decreased the write throughput. This was expected and not something I want for the workload. 

Changing checkpoint threads from 1 (the default) to 2 increased the peak insert rate by ~8% without any bad side effects. I will continue to use 2 checkpoint threads for future tests.

Vadim Tkachenko

unread,
May 11, 2016, 6:20:00 PM5/11/16
to percona-discussion
Mark,

With checkpoint threads the point to consider not only peak rate, but
also variance in throughput during checkpoint.
The more threads are allocated for checkpointing - the more resources
are taken from user threads and the bigger performance drop you may
observe. But on other hand it will be shorter, as checkpoint will
finish faster.
Vadim Tkachenko, CTO, Percona
Phone +1-919-249-7483, Skype: vadimtk153

MARK CALLAGHAN

unread,
May 11, 2016, 6:43:55 PM5/11/16
to percona-d...@googlegroups.com
I don't see more variance from this for insert-only insert benchmark. When I add queries concurrent with rate-limited inserts there is another problem that is so significant that not much else matters. The problem is the  per node locking done when the Toku equivalent of compaction is in progress. I am not sure whether it is a mutex or a RW-lock, but that doesn't matter much. The end result is that both insert and query performance are much worse than I expect. I have been using tokudb_block_size=4M and will repeat a few tests with something smaller.

MARK CALLAGHAN

unread,
May 24, 2016, 1:37:20 PM5/24/16
to percona-d...@googlegroups.com
tokudb_block_size=1M helped for part of the test but hurt other parts. I tried changing tokudb_cleaner_iterations to a smaller value than 10,000 but that made things worse.

The problem I have on the insert benchmark when there are inserts and queries at the same time is that the QPS for queries drops to zero. This reproduces on servers with fast SSD and with a small disk array.
--
Mark Callaghan
mdca...@gmail.com

George Lorch

unread,
May 24, 2016, 1:48:18 PM5/24/16
to Percona Discussion
If it is periodic it is likely the checkpointing issues in FT, otherwise it could be the internal FT node locking contention issue. With a single write lock high in a tree you effectively block all access to everything from that particular node downwards. The FT node locking is entirely too coarse when it comes to write locks and the huge node sizes (when compared to innodb) exacerbate the problem as now a single write lock locks a much more significant portion of the data set.

We've got a few things coming down the pike to address the checkpointing issues and the node lock granularity among other things but they are all pretty invasive ideas so proving them out and testing will take quite some time before they see the real world.

>>> To post to this group, send email to percona-d...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>>
>> --
>> Mark Callaghan
>> mdca...@gmail.com
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Percona Discussion" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/percona-discussion/iDEymBKnjdw/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to

>> To post to this group, send email to percona-d...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> George O. Lorch III
>> Software Engineer, Percona
>> US/Arizona (GMT -7)
>> skype: george.ormond.lorch.iii
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Percona Discussion" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> To post to this group, send email to percona-d...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Mark Callaghan
> mdca...@gmail.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "Percona Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> To post to this group, send email to percona-d...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Vadim Tkachenko, CTO, Percona
Phone +1-919-249-7483,  Skype: vadimtk153

--
You received this message because you are subscribed to the Google Groups "Percona Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to percona-discussion+unsub...@googlegroups.com.

To post to this group, send email to percona-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Mark Callaghan
mdca...@gmail.com



--
Mark Callaghan
mdca...@gmail.com

MarkCallaghan

unread,
Dec 23, 2016, 12:04:45 PM12/23/16
to Percona Discussion
I am not asking you to fix this. I am sharing here so you don't learn of this first when I share results in pubic.

Upgraded to TokuDB 5.7.15 from Percona Server, running the insert benchmark using https://github.com/mdcallag/mytools/blob/master/bench/ibench/iq.sh. There are 3 phases in this test, insert only, inserts rate limited to 1000/second with concurrent queries, inserts rate limited to 100/second with concurrent queries. The query rates drop to 0 per second during the second and third phases.

This isn't bad query plans from bad index cardinality stats. I checked the results for SHOW INDEXES and SHOW TABLE STATUS and the numbers are good. The QPS rate drops because the queries do full (or long) index scans when they should be doing short index only scans on covering secondary indexes. InnoDB in MySQL 5.6.26 and 5.7.10 is OK on this. MyRocks from FB MySQL is OK on this. The problem is that TokuDB uses type:index when other engines use type:range for the index access. Is something ignoring the use of the LIMIT clause?

By accident the tests used values for tokudb_analyze_mode that are required to avoid bad query plans with sysbench. I might undo that for a few tests, but the index stats are good in this test. They don't drift to zero -- maybe that is a problem for update statements.

Slow query log output:
# Time: 2016-12-23T16:30:55.336869Z
# User@Host: root[root] @ localhost [127.0.0.1]  Id:  2524
# Schema: ib  Last_errno: 0  Killed: 0
# Query_time: 17.990052  Lock_time: 0.000068  Rows_sent: 10  Rows_examined: 41361128  Rows_affected: 0
# Bytes_sent: 540
SET timestamp=1482510655;
SELECT price,dateandtime,customerid FROM pi8 FORCE INDEX (pdc) WHERE (price>=319.30) ORDER BY price,dateandtime,customerid LIMIT 10;
# Time: 2016-12-23T16:30:56.928372Z
# User@Host: root[root] @ localhost [127.0.0.1]  Id:  2544
# Schema: ib  Last_errno: 0  Killed: 0
# Query_time: 29.310352  Lock_time: 0.000126  Rows_sent: 10  Rows_examined: 67533688  Rows_affected: 0
# Bytes_sent: 540
SET timestamp=1482510656;
SELECT price,dateandtime,customerid FROM pi9 FORCE INDEX (pdc) WHERE (price>=519.87) ORDER BY price,dateandtime,customerid LIMIT 10;
# Time: 2016-12-23T16:30:59.099488Z
# User@Host: root[root] @ localhost [127.0.0.1]  Id:  2533
# Schema: ib  Last_errno: 0  Killed: 0
# Query_time: 7.612549  Lock_time: 0.000059  Rows_sent: 10  Rows_examined: 18340078  Rows_affected: 0
# Bytes_sent: 546
SET timestamp=1482510659;
SELECT price,dateandtime,customerid FROM pi12 FORCE INDEX (pdc) WHERE (price>=143.04) ORDER BY price,dateandtime,customerid LIMIT 10;
# Time: 2016-12-23T16:31:01.702954Z
# User@Host: root[root] @ localhost [127.0.0.1]  Id:  2555
# Schema: ib  Last_errno: 0  Killed: 0
# Query_time: 17.916199  Lock_time: 0.000052  Rows_sent: 10  Rows_examined: 42871166  Rows_affected: 0
# Bytes_sent: 540
SET timestamp=1482510661;
SELECT price,dateandtime,customerid FROM pi2 FORCE INDEX (pdc) WHERE (price>=330.87) ORDER BY price,dateandtime,customerid LIMIT 10;

Explain of a problem query for TokuDB

mysql> explain SELECT price,dateandtime,customerid FROM pi5 FORCE INDEX (pdc) WHERE (price>=389.23) ORDER BY price,dateandtime,customerid LIMIT 10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: pi5
   partitions: NULL
         type: index
possible_keys: pdc
          key: pdc
      key_len: 14
          ref: NULL
         rows: 10
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

Explain of that problem query for InnoDB 5.7.10

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: pi5
   partitions: NULL
         type: range
possible_keys: pdc
          key: pdc
      key_len: 4
          ref: NULL
         rows: 64877752
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

Table for problem query:
mysql> show create table pi5\G
*************************** 1. row ***************************
       Table: pi5
Create Table: CREATE TABLE `pi5` (
  `transactionid` int(11) NOT NULL AUTO_INCREMENT,
  `dateandtime` datetime DEFAULT NULL,
  `cashregisterid` int(11) NOT NULL,
  `customerid` int(11) NOT NULL,
  `productid` int(11) NOT NULL,
  `price` float NOT NULL,
  `data` varchar(4000) DEFAULT NULL,
  PRIMARY KEY (`transactionid`),
  KEY `marketsegment` (`price`,`customerid`),
  KEY `registersegment` (`cashregisterid`,`price`,`customerid`),
  KEY `pdc` (`price`,`dateandtime`,`customerid`)
) ENGINE=TokuDB AUTO_INCREMENT=130575101 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Index stats:
mysql> show indexes from pi5;
+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name        | Seq_in_index | Column_name    | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| pi5   |          0 | PRIMARY         |            1 | transactionid  | A         |   130577696 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | marketsegment   |            1 | price          | A         |      132298 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | marketsegment   |            2 | customerid     | A         |    65288848 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | registersegment |            1 | cashregisterid | A         |        1086 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | registersegment |            2 | price          | A         |   130577696 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | registersegment |            3 | customerid     | A         |   130577696 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | pdc             |            1 | price          | A         |      132971 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | pdc             |            2 | dateandtime    | A         |   130577696 |     NULL | NULL   | YES  | BTREE      |         |               |
| pi5   |          1 | pdc             |            3 | customerid     | A         |   130577696 |     NULL | NULL   |      | BTREE      |         |               |
+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

mysql> show table status like "pi5"\G
*************************** 1. row ***************************
           Name: pi5
         Engine: TokuDB
        Version: 10
     Row_format: tokudb_zlib
           Rows: 130580100
 Avg_row_length: 43
    Data_length: 5614946640
Max_data_length: 9223372036854775807
   Index_length: 14448534548
      Data_free: 49168380
 Auto_increment: 130580101
    Create_time: 2016-12-22 20:30:16
    Update_time: 2016-12-23 08:39:41
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

/etc/my.cnf

[mysqld]
sql_mode=no_engine_substitution,no_unsigned_subtraction
default_tmp_storage_engine=MyISAM
tmpdir=/data/mysql/toku
query_cache_size=0
query_cache_type=0
slow_query_log_file=/data/mysql/toku/slow.log
long_query_time=5
max_binlog_size=128M
sync_binlog=0
server_id=77
datadir=/data/mysql/toku/data

tokudb_cache_size               = 10G
tokudb_directio                 = OFF
tokudb_empty_scan               = disabled
tokudb_read_block_size          = 16K
tokudb_commit_sync              = OFF
tokudb_checkpointing_period     = 300
tokudb_block_size               = 4M
#  tokudb_block_size               = 1M
tokudb_cleaner_iterations       = 10000
tokudb_enable_partial_eviction  = OFF
#   tokudb_fanout                   = 32
#   tokudb_checkpoint_pool_threads  = 2
tokudb_row_format               = tokudb_zlib
#tokudb_row_format               = tokudb_snappy
#tokudb_row_format               = tokudb_uncompressed
tokudb_fsync_log_period         = 1000

#default-storage-engine          = tokudb
#skip-innodb

default-tmp-storage-engine=MyISAM

#skip_log_bin
log_bin=/binlogs/myrocks/bl
binlog_format=row

tokudb_disable_prefetching = ON

tokudb_analyze_in_background = OFF
tokudb_analyze_time=10
#tokudb_analyze_mode=TOKUDB_ANALYZE_STANDARD
tokudb_analyze_mode=TOKUDB_ANALYZE_RECOUNT_ROWS
#tokudb_auto_analyze=0

slow_query_log = ON
long_query_time=1

#skip_log_bin
tokudb_cache_size               = 10G

[mysqld-safe]
malloc-lib=/usr/lib64/libjemalloc.so.1

MarkCallaghan

unread,
Dec 23, 2016, 12:09:22 PM12/23/16
to Percona Discussion
I get a good query plan (type:range) after running ANALYZE TABLE. However, index/table stats aren't different after analyze. The workaround for me it to hack my benchmark scripts to add analyze commands. 


mysql> show indexes from pi5;
+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name        | Seq_in_index | Column_name    | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| pi5   |          0 | PRIMARY         |            1 | transactionid  | A         |   130631104 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | marketsegment   |            1 | price          | A         |      132352 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | marketsegment   |            2 | customerid     | A         |    65315552 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | registersegment |            1 | cashregisterid | A         |        1086 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | registersegment |            2 | price          | A         |   130631104 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | registersegment |            3 | customerid     | A         |   130631104 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | pdc             |            1 | price          | A         |      133026 |     NULL | NULL   |      | BTREE      |         |               |
| pi5   |          1 | pdc             |            2 | dateandtime    | A         |   130631104 |     NULL | NULL   | YES  | BTREE      |         |               |
| pi5   |          1 | pdc             |            3 | customerid     | A         |   130631104 |     NULL | NULL   |      | BTREE      |         |               |
+-------+------------+-----------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
9 rows in set (0.00 sec)

mysql> show table status like "pi5"\G
*************************** 1. row ***************************
           Name: pi5
         Engine: TokuDB
        Version: 10
     Row_format: tokudb_zlib
           Rows: 130631100
 Avg_row_length: 43
    Data_length: 5617139452
Max_data_length: 9223372036854775807
   Index_length: 6181893281
      Data_free: 43679740
 Auto_increment: 130631101
    Create_time: 2016-12-22 20:30:16
    Update_time: 2016-12-23 08:55:22
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

mysql> explain SELECT price,dateandtime,customerid FROM pi5 FORCE INDEX (pdc) WHERE (price>=389.23) ORDER BY price,dateandtime,customerid LIMIT 10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: pi5
   partitions: NULL
         type: range
possible_keys: pdc
          key: pdc
      key_len: 4
          ref: NULL
         rows: 80702611

MarkCallaghan

unread,
Dec 24, 2016, 10:33:11 AM12/24/16
to Percona Discussion
Tried running analyze table prior to the test phases that do queries with reads. That briefly prevents the problem but soon query plans switch from type:range (good) to type:index (bad). I set tokudb_analyze_in_background=false and all tokudb stats options were then...

tokudb_analyze_delete_fraction  1.000000
tokudb_analyze_in_background    OFF
tokudb_analyze_mode     TOKUDB_ANALYZE_STANDARD
tokudb_analyze_throttle 0
tokudb_analyze_time     5
tokudb_auto_analyze     30

At test end I again did analyze table for all tables and the plan was still bad (type:index). Then I restarted mysqld and after that the plan was good (type:range). 

I also get intermittent errors with analyze table. There is no error message in the db error log. Most recent error occurred with this config:
tokudb_analyze_delete_fraction  1.000000
tokudb_analyze_in_background    ON
tokudb_analyze_mode     TOKUDB_ANALYZE_STANDARD
tokudb_analyze_throttle 0
tokudb_analyze_time     5
tokudb_auto_analyze     30

Here is one example of the failure, the table for which it fails changes over time
# for x in $( seq 1 8 ); do ( mysql -uroot -ppw ib -e "analyze table pi$x" & ); done
Table   Op      Msg_type        Msg_text
ib.pi1  analyze status  OK
Table   Op      Msg_type        Msg_text
ib.pi2  analyze status  OK
Table   Op      Msg_type        Msg_text
ib.pi8  analyze status  Operation failed
Table   Op      Msg_type        Msg_text
ib.pi4  analyze status  OK
Table   Op      Msg_type        Msg_text
ib.pi5  analyze status  OK
Table   Op      Msg_type        Msg_text
Table   Op      Msg_type        Msg_text
ib.pi3  analyze status  OK
ib.pi7  analyze status  OK
Table   Op      Msg_type        Msg_text
ib.pi6  analyze status  OK

MarkCallaghan

unread,
Dec 26, 2016, 2:40:55 PM12/26/16
to Percona Discussion
Despite my best efforts I can't make this work where "this" is queries concurrent with inserts with my scripts - https://github.com/mdcallag/mytools/blob/master/bench/ibench/iq.sh

I tried recollecting stats but that didn't work. What works, at least briefly, is to restart mysqld. By works I mean that the query plan uses type:range rather than type:index. But after ~30 minutes the query plan returns to full index scans (type:index) and only restarting mysqld fixes that.

George Lorch

unread,
Jan 2, 2017, 3:33:58 PM1/2/17
to Percona Discussion
Hey Mark,
Sorry for the delayed response, I have been away for a few weeks and unable to research or respond.

So thanks for digging into this. This behavior is very curious. I can not explain why the optimizer would change the query plan choice if the engine is reporting roughly the same row count and index statistics and a restart of the server clears the issue. It seems maybe that there is some other kind of transient state that the optimizer is relying on that isn't overly visibly via the CLI that is getting skewed after some period of running. Once I get all caught up on emails and other issues I will see if I can reproduce this 'bad plan choice' and identify the cause.

--
George O. Lorch III
Senior Software Engineer, Percona

George Lorch

unread,
Jan 3, 2017, 4:38:06 PM1/3/17
to Percona Discussion
I am working on a reproducer for this but one thing I notices is your use of tokudb_auto_analyze. To totally disable automatic analysis, this must be set to 0. The default is 10 and in one of your tests above you have it set to 30. This value can be defined as "at what percentage of table change (insert/update/delete) should an analyze be triggered". So as an example, if you set this to 30, have 1000 rows, an analyze will be triggered once you have dome some combination of 300 inserts/updates/deletes.

With the non default value of tokudb_analyze_in_background=off that you are using, this means that this analysis will take place within the context of the user thread that causes this threshold to be hit. Since you have your tokudb_analyze_time=5 and 10 on different tests and a fairly large number of rows, the client/user threads that trigger the analyze will potentially take 5 or 10 seconds to do the analysis each time the analysis threshold is hit.

This is specifically why we implemented the automatic analysis as a background task and set all defaults for it to operate that way. The only reason we allow it to analyze in the foreground is to continue to mimick the traditional way ANALYZE TABLE operates for those that want to script explicit analysis invocation and know when it is completes.

So this is just all a warning that overriding the defaults to use automatic, foreground analysis will skew your results in weird wais by introducing the analyze time as pary of a insert/update/delete response time periodically. You should either leave it as default tokudb_analyze_in_background=ON with your existing tokudb_auto_analyze=10|30 and tokudb_analyze_time=5|10, or, you should just disable automatic analysis entirely with tokudb_auto_analyze=0 and stick with whatever index cardinality stats were collected from the last time you analyzed.

Either way, I don't believe this has anything to do with a query plan type changing between index and range since you say that the row counts remain stable and cardinality stats also stay within the expected range unless the optimizer is somehow noticing this excessive time on some of the inserts and altering the query plan decision.
Reply all
Reply to author
Forward
0 new messages