index cardinality and "Rows" drops to 0 with TokuDB in Percona 5.7.15-9

MarkCallaghan

non lue,

30 nov. 2016, 23:33:1030/11/2016

à Percona Discussion

I am using Percona 5.7.15-9. How do I disable auto stats collection? I am willing to collect stats once and then run a few performance tests and will be happy with optimizer stats that don't change over time as inserts, updates and deletes are done. I have yet to figure out how to configure TokuDB to do that. For the tests I have been running in some cases the optimizer stats (cardinality in SHOW INDEXES and Rows in table status) drop to zero and that causes lousy query plans.

i am confused by some of https://www.percona.com/doc/percona-server/5.7/tokudb/tokudb_background_analyze_table.html. I have yet to find the my.cnf options that makes this claim true:
"If an analyze is never performed on a table then the cardinality is 1 for primary key indexes and unique secondary indexes, and NULL (unknown) for all other indexes. Proper cardinality can lead to improved performance of complex SQL statements."

My test workload is sysbench. The base table has 8M rows and the test pattern is load, create secondary index, run more sysbench tests. Immediately after the load I see cardinality=8M for the PK. After the secondary index has been created I see cardinality=8M for the PK and secondary index. Then the sysbench tests start and the first two are update-only workloads and over time the cardinality drops to zero. The drop is a function of the update rate, so running with too little concurrency can take a long time to see the problem. The 'Rows' value in show table status also drops to zero. See https://gist.github.com/mdcallag/bd47d712802a281b44a5b939ff697cff

The workload is update-only when the problem occurs. Running ANALYZE TABLE after the problem fixes it assuming this is done: set @@tokudb_analyze_mode=TOKUDB_ANALYZE_RECOUNT_ROWS;

Simple range queries get lousy plans when Rows=0 and cardinality=0. That turns them into full index scans. For example here is a good plan before the stats become zero: https://gist.github.com/mdcallag/e2a0dee51f8470ac74ce5e85670eb175 and here is a bad plan after they become zero: https://gist.github.com/mdcallag/e1b78a52543f9ff1841ffadf0137f67f

I am using RPMs like Percona-Server-server-57-5.7.15-9.1.el6.x86_64.rpm and running sysbench via https://github.com/mdcallag/mytools/blob/master/bench/sysbench.lua/all.sh
The command line for my script that runs sysbench is: bash all.sh 1 80000000 180 300 tokudb 1 0 mysql

my.cnf is:

[mysqld]

sql_mode=no_engine_substitution,no_unsigned_subtraction

default_tmp_storage_engine=MyISAM

tmpdir=/data/mysql/toku

query_cache_size=0

query_cache_type=0

slow_query_log_file=/data/mysql/toku/slow.log

long_query_time=5

max_binlog_size=128M

sync_binlog=0

server_id=77

datadir=/data/mysql/toku/data

tokudb_cache_size = 2G

tokudb_directio = OFF

tokudb_empty_scan = disabled

tokudb_read_block_size = 16K

tokudb_commit_sync = OFF

tokudb_checkpointing_period = 300

tokudb_block_size = 4M

tokudb_cleaner_iterations = 10000

tokudb_enable_partial_eviction = OFF

tokudb_row_format = tokudb_uncompressed

tokudb_fsync_log_period = 1000

log_bin=/binlogs/myrocks/bl

binlog_format=row

tokudb_disable_prefetching = ON

slow_query_log = ON

long_query_time=1

[mysqld-safe]

malloc-lib=/usr/lib64/libjemalloc.so.1

----

Background analyze is enabled by default, so it was on. But I am not sure which of these is the problem.

tokudb_analyze_delete_fraction 1.000000

tokudb_analyze_in_background ON

tokudb_analyze_mode TOKUDB_ANALYZE_STANDARD

tokudb_analyze_throttle 0

tokudb_analyze_time 5

tokudb_auto_analyze 30

MarkCallaghan

non lue,

1 déc. 2016, 10:12:2701/12/2016

à Percona Discussion

Is it possible to disable the automatic management of optimizer stats (index cardinality and #rows)? I repeated my test with tokudb_auto_analyze=0 and the problem repeats. My tests would work if I could collect stats after a table load, then leave them unchanged for the test.

Then I ran 'analyze table sbtest1' after the problem occurs with tokudb_analyze_mode=TOKUDB_ANALYZE_STANDARD and the analyze finished quickly as expected, but the optimizer stats were still zero at the end. If this problem can't be avoided I want a way to quickly fix the stats. This fixes the stats but it takes too long for me to run it frequently:
set session tokudb_analyze_mode=TOKUDB_ANALYZE_RECOUNT_ROWS

analyze table sbtest1;

MarkCallaghan

non lue,

1 déc. 2016, 11:28:2401/12/2016

à Percona Discussion

I have 4 ways to run the test. For 2 of them all data is in the TokuDB block cache, for the other 2 maybe 10% of data is in the TokuDB block cache. It looks like this problem only occurs when all data is not in the block cache.

For all data in the block cache I set tokudb_cache_size=100G and use these command lines:

# 1 table, 8M rows
bash all.sh 1 8000000 180 300 tokudb 1 0 mysql

# 8 tables, 1M rows/table

bash all.sh 8 1000000 180 300 tokudb 1 0 mysql

For tests that don't cache all data in TokuDB I set tokudb_cache_size=2G and use these command lines:

# 1 table, 80M rows

bash all.sh 1 80000000 180 300 tokudb 1 0 mysql

# 8 tables, 10M rows/table

bash all.sh 8 10000000 180 300 tokudb 1 0 mysql

George Lorch

non lue,

1 déc. 2016, 13:30:0701/12/2016

à Percona Discussion

Thanks for the testing on this Mark.

First off, let me cover the analysis angle. The auto analyze for cardinality does not impact the row count. The FT library tries to maintain a logical running row count, and that can be quite fuzzy as I will explain a bit later.

The automatic background analysis only recalculates cardinality statistics for the range of rows that it is permitted time-wise to cover and assumes that the distribution is somewhat the same across the data range. Internally, the server expects us to report back to it the number of records per key and key part. I see where the cofusion is on how to disable as the docs do not mention it. I do remember specifying how to disable but it must have gotten lost along the line. I am going to go back and figure out how this happened and ensure the docs are updated. It is actually quite simple, in your my.cnf, set tokudb_auto_analyze=0

As you can see, once a reported row count hits 0, cardinality information is useless, the optimizer seems to just ignore it and revert to table scans. Disabling the auto analysis will have no effect on the inaccurate row counts.

Now on to what I see is really your issue. Back in PS 5.6.32-78.0 and 5.7.14-7 we fixed https://tokutek.atlassian.net/browse/FT-732 and https://tokutek.atlassian.net/browse/DB-1006. This issue matches your description precisely which is somewhat concerning and indicates that we missed something and did not uncover it in testing.

The issue comes down to deciding how you adjust a running logical row count (LRC) when you have a data structure that supports blind and deferred operations (messages). For example, a delete that might not have a target record, or an update that turns into an insert as the result of a missing record, or an insert that turns into an update due to a record collision. All of these are real things that occur within the Fractal Tree and their ultimate result on the LRC can not be pre-determined. Only when the messages are physically and permanently delivered to the leafs can the impact be known.

So what we have in the FT is a 'guess and correct' system. It guesses how to adjust the LRC based on the message type when it is dropped into the top of the tree, then corrects the LRC when the message reaches the leaf entry (row) and is applied. One interesting and pertinent data point is that FT treats all updates as inserts. So, an update at the top of the tree is converted to an insert message and as such immediately increments the LRC. When that insert message is applied to an existing record, it is recognized as an update and therefore then 'corrects' the LRC by decrementing it.

This sounds simple enough until you add in searches/scans/queries. In order for a query to return the transactional truth, all of the pending messages above the target leaf entries must be pulled down and applied. This activity results in the LRC being adjusted by the 'correction' logic, meaning that inserts (which are actually updates) correct the LRC downward to counter the upward bump it got when the message was originally pushed into the tree. For these queries, this activity is NOT persistent, meaning the messages remain in the tree and the leaf node is not made dirty. Only when messages are pushed downwards via message flusher, gorged node, node splits/merges, or other write operations do things get persisted. This causes problems then when nodes get evicted from memory that have 'corrected' the LRC but not been persisted. It means then that that adjustment must be undone because from a persisted truth point of view, the LRC should have never changed. This is where I think the problem lies in your test case, we are applying updates (inserts), causing the LRC to decrement downwards, somehow evicting the node without 'un-correcting' the LRC, re-reading the node again and repeating the process until we get to 0. The update in your last post indicates that it seems worse when the trees don't fit into memory. This really fits the situation I described as it means more evictions and points to the possibility that there is some node eviction path that is not properly 'un-correcting' the LRC.

I am going to re-open the FT and DB issues and see if I can figure out where the LRC is not being un-corrected on eviction and will post back with any discovery.

George Lorch

non lue,

1 déc. 2016, 13:38:4501/12/2016

à Percona Discussion

Also, unfortunately there is no easy way to edit/change the LRC via SQL or any other means. We have had a low priority task to expose all of the TokuDB table info (row counts, cardinality stats, etc) as a standard system table that is update-able via SQL which would allow a quick workaround of something like "UPDATE TABLE tokudb_statistics SET row_count=10000 WHERE database='foo' AND table='bar';"

Your discovery/reading and use of the recount_rows functionality is right on and yes, it can be quite slow as it needs to fully traverse the table, it really isn't much different from a SELECT COUNT(*). The belief was that we had all of this fixed and no need for any periodic adjustments other than someone migrating a table forward to a version of FT that maintained the LRC.

George Lorch

non lue,

1 déc. 2016, 15:01:2001/12/2016

à Percona Discussion

Mark, can you share your exact secondary that you create?

George Lorch

non lue,

1 déc. 2016, 17:25:0801/12/2016

à Percona Discussion

I believe I have reproduced this using the standard sysbench schema and with an extreme eviction heavy workload with a mix of the standard oltp selects and index and non-index updates. It'll need to run for a bit to ensure I do see the 0 crossing but I do see a continuous and aggressive decrease in LRC with no indication of it slowing or stopping so far.

MarkCallaghan

non lue,

1 déc. 2016, 20:49:5001/12/2016

à Percona Discussion

DDL for sysbench comes from https://github.com/akopytov/sysbench/blob/1.0/sysbench/tests/db/common.lua

One other question. Sometimes I see this and I wonder whether the cost of analyze done in the middle of an SQL statement will make it run longer than I want it to. I don't think this is a good thing to do to a user who wants less variance, and if this update takes 10 seconds longer because of analyze, what happens when it is part of a transaction holding locks that block others who now must wait for 10 seconds.

1076 root localhost:58352 test Query 1 analyze table standard test.sbtest1.k_1 1 of 2 27% rows 10% time, scanning forward UPDATE sbtest1 SET c='97165954714-08811264689-56153433123-57124467619-11642119545-56069335926-839587 0 1

From the slow log I think this is a problem. tokudb_analyze_time=10 in my.cnf and I see slow update statements that take a bit more than 10 seconds. So I will blame analyze.

# Time: 2016-12-02T01:47:38.681632Z

# User@Host: root[root] @ localhost [127.0.0.1] Id: 1080

# Schema: test Last_errno: 0 Killed: 0

# Query_time: 10.003225 Lock_time: 0.000036 Rows_sent: 0 Rows_examined: 1 Rows_affected: 1

# Bytes_sent: 52

SET timestamp=1480643258;

UPDATE sbtest1 SET c='69810220561-50025519356-75199296916-11629574915-22717579580-20154582166-71154826779-43934982062-09004798228-92442560628' WHERE id=40134897;

relevant my.cnf settings in this case are:

tokudb_analyze_in_background = OFF

tokudb_analyze_time=10

#tokudb_analyze_mode=TOKUDB_ANALYZE_STANDARD

tokudb_analyze_mode=TOKUDB_ANALYZE_RECOUNT_ROWS

#tokudb_auto_analyze=0

MarkCallaghan

non lue,

1 déc. 2016, 20:53:3301/12/2016

à Percona Discussion

On Thursday, December 1, 2016 at 10:30:07 AM UTC-8, George Lorch wrote:

Thanks for the testing on this Mark.

First off, let me cover the analysis angle. The auto analyze for cardinality does not impact the row count. The FT library tries to maintain a logical running row count, and that can be quite fuzzy as I will explain a bit later.

The automatic background analysis only recalculates cardinality statistics for the range of rows that it is permitted time-wise to cover and assumes that the distribution is somewhat the same across the data range. Internally, the server expects us to report back to it the number of records per key and key part. I see where the cofusion is on how to disable as the docs do not mention it. I do remember specifying how to disable but it must have gotten lost along the line. I am going to go back and figure out how this happened and ensure the docs are updated. It is actually quite simple, in your my.cnf, set tokudb_auto_analyze=0

As you can see, once a reported row count hits 0, cardinality information is useless, the optimizer seems to just ignore it and revert to table scans. Disabling the auto analysis will have no effect on the inaccurate row counts.

Now on to what I see is really your issue. Back in PS 5.6.32-78.0 and 5.7.14-7 we fixed https://tokutek.atlassian.net/browse/FT-732 and https://tokutek.atlassian.net/browse/DB-1006. This issue matches your description precisely which is somewhat concerning and indicates that we missed something and did not uncover it in testing.

The issue comes down to deciding how you adjust a running logical row count (LRC) when you have a data structure that supports blind and deferred operations (messages). For example, a delete that might not have a target record, or an update that turns into an insert as the result of a missing record, or an insert that turns into an update due to a record collision. All of these are real things that occur within the Fractal Tree and their ultimate result on the LRC can not be pre-determined. Only when the messages are physically and permanently delivered to the leafs can the impact be known.

So what we have in the FT is a 'guess and correct' system. It guesses how to adjust the LRC based on the message type when it is dropped into the top of the tree, then corrects the LRC when the message reaches the leaf entry (row) and is applied. One interesting and pertinent data point is that FT treats all updates as inserts. So, an update at the top of the tree is converted to an insert message and as such immediately increments the LRC. When that insert message is applied to an existing record, it is recognized as an update and therefore then 'corrects' the LRC by decrementing it.

This sounds simple enough until you add in searches/scans/queries. In order for a query to return the transactional truth, all of the pending messages above the target leaf entries must be pulled down and applied. This activity results in the LRC being adjusted by the 'correction' logic, meaning that inserts (which are actually updates) correct the LRC downward to counter the upward bump it got when the message was originally pushed into the tree. For these queries, this activity is NOT persistent, meaning the messages remain in the tree and the leaf node is not made dirty. Only when messages are pushed downwards via message flusher, gorged node, node splits/merges, or other write operations do things get persisted. This causes problems then when nodes get evicted from memory that have 'corrected' the LRC but not been persisted. It means then that that adjustment must be undone because from a persisted truth point of view, the LRC should have never changed. This is where I think the problem lies in your test case, we are applying updates (inserts), causing the LRC to decrement downwards, somehow evicting the node without 'un-correcting' the LRC, re-reading the node again and repeating the process until we get to 0. The update in your last post indicates that it seems worse when the trees don't fit into memory. This really fits the situation I described as it means more evictions and points to the possibility that there is some node eviction path that is not properly 'un-correcting' the LRC.

With MyRocks we compute statistics per SST file and AFAIK the global metrics are adjusted each time an SST file is created or deleted. The per-SST file stats are written into the SST file. I have not looked at the behavior closely but it seems to work well so far. AFAIK compaction in RocksDB (delete some SST files, create new SST files) is similar to moving messages down the fractal tree.

George Lorch

non lue,

2 déc. 2016, 15:57:5502/12/2016

à Percona Discussion

Hello again Mark,

On Thursday, December 1, 2016 at 6:49:50 PM UTC-7, MarkCallaghan wrote:

DDL for sysbench comes from https://github.com/akopytov/sysbench/blob/1.0/sysbench/tests/db/common.lua

Ahh, Thanks for that. I thought you implied that you added more indices to the table than the standard sysbench schema. It's all good anyway, I have it reproduced and testing out a fix as I type this.

One other question. Sometimes I see this and I wonder whether the cost of analyze done in the middle of an SQL statement will make it run longer than I want it to. I don't think this is a good thing to do to a user who wants less variance, and if this update takes 10 seconds longer because of analyze, what happens when it is part of a transaction holding locks that block others who now must wait for 10 seconds.

Completely agreed, which is why the default for tokudb_analyze_in_background=true, and why we implemented all of the background job manager and i_s.tokudb_background_job_manager table. With this we can monitor what is going on and possibly add more to it such as maybe optimize or other maintenance functions. Running in the background is beneficial for a few reason but mainly so one can give it time to run, throttle it so it doesn't cause starvation, and run without holding as many logical locks as if it were running in the foreground. Similarly to something you said before, in many workloads many only need to run analyze once and stick with those results nearly forever if their key distributions stay relatively the same over time, so no need to even bother with auto analysis. I would actually have liked to make the analyze smarter and work w/ the FT library to behave a little more like InnoDB and do dives down the tree for only nodes that are in memory, but that would have been quite a bit more work and we had a user that needed the auto analysis implemented and released ASAP. We still have this on our roadmap but don't know when we will get to it as there are many other tasks ahead of it.

George Lorch

non lue,

2 déc. 2016, 16:04:5202/12/2016

à Percona Discussion

I was wondering how RocksDB maintains a reasonably accurate row count as it also has the whole deferred/blind operational issue as well. What I really wanted to do rather than this fuzzy counting was to embed the LRC into the FT nodes so that each node maintained a cumulative logical count of what it had within its subtree, but that would involve changing the on disk format of the nodes and an upgrade path that we really didn't want to have to deal with.

ross....@avantalytics.com

non lue,

21 déc. 2016, 08:14:2921/12/2016

à Percona Discussion

I'm having the same problem in PS 5.7.16-10 on a table with significant numbers of updates. The interesting thing I see is the suggested solution above for a tempoary fix produces the following:

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> set session tokudb_analyze_mode=TOKUDB_ANALYZE_RECOUNT_ROWS;
Query OK, 0 rows affected (0.00 sec)

mysql> analyze table db.table;
+--------------------------+---------+----------+------------------+
| Table | Op | Msg_type | Msg_text |
+--------------------------+---------+----------+------------------+
| db.table | analyze | status | Operation failed |
+--------------------------+---------+----------+------------------+
1 row in set (0.00 sec)

mysql>

So I can't even fix the problem it seems with a table dump/reload. Any other suggestions?

ross....@avantalytics.com

non lue,

21 déc. 2016, 08:38:1221/12/2016

à Percona Discussion

Hi George,

I'm experiencing this same issue in 5.7.16-10 - were you able to come up with a fix for this? I cannot even fix the row count issue with a table analyze.

George Lorch

non lue,

2 janv. 2017, 15:09:3202/01/2017

à Percona Discussion,ross....@avantalytics.com

Hello Ross,

I apologize for the delay in responding as I have been out on holiday for a few weeks.

Yes, we have a fix that is in review and testing that will address the 'row count descends to 0 under certain workloads' as tracked here https://tokutek.atlassian.net/browse/FT-732

Ross Carver

non lue,

3 janv. 2017, 03:53:4403/01/2017

à George Lorch,Percona Discussion

Thanks George - I'm glad its been re-opened, hope to see the fix soon!

Ross

--

Ross Carver
ross....@avantalytics.com
Avantalytics Corp
Ph. 866-302-6747 x314

Ph. 571-406-5395 (direct)
Ph. 703-254-9294 (mobile)

MarkCallaghan

non lue,

22 févr. 2017, 20:29:1322/02/2017

à Percona Discussion,ross....@avantalytics.com

The Percona 5.7.17 build worked great for me with sysbench. Up next is the insert benchmark.

MarkCallaghan

non lue,

6 mars 2017, 13:28:0106/03/2017

à Percona Discussion,ross....@avantalytics.com

5.7.17-11 doesn't fix the problem for the insert benchmark. Once I start running queries concurrent with inserts eventually the query plans switch from "range" to "index". Restarting mysqld is a workaround.

George Lorch

non lue,

6 mars 2017, 13:55:4806/03/2017

à Percona Discussion,ross....@avantalytics.com

Thanks for the update Mark. This was the one that I could not reproduce and visual inspection of the code in the TokuDB handler shows nothing that would indicate how this transient state can be reached. The fact that a restart clears the issue tells me that that something in the table transient share instance is getting to some incorrect/bad state, which then percolates out to the optimizer in the form of a sub-optimal query plan.

I will pass this on to our QA team and see if we can reproduce and possibly reduce to a repeatable case where I can dig further.

If you happen to hit it again, can you get me the result of show indexes and show table status on the problem table when in the 'bad' state and then again after you restart when in the 'good' state? It would be greatly appreciated.

George Lorch

non lue,

8 mars 2017, 15:19:3008/03/2017

à Percona Discussion,ross....@avantalytics.com

Thanks to some sleuthing help from Rick Pizzi, I think we understand what might be confusing the optimizer now. I'm working on a fix and test and we'll see if it is the same thing.

Riccardo Pizzi

non lue,

8 mars 2017, 15:24:2708/03/2017

à percona-d...@googlegroups.com

Some details on my case here https://bugs.launchpad.net/percona-server/+bug/1671152

Rick
—
Riccardo Pizzi
pi...@leopardus.com
http://www.pinballowners.com/webmaster

> --
> You received this message because you are subscribed to the Google Groups "Percona Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to percona-discuss...@googlegroups.com.
> To post to this group, send email to percona-d...@googlegroups.com.
> Visit this group at https://groups.google.com/group/percona-discussion.
> To view this discussion on the web visit https://groups.google.com/d/msgid/percona-discussion/aed95737-264c-4dc0-9cc0-242fe89e89ed%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Rick Pizzi

non lue,

10 mars 2017, 06:57:3210/03/2017

à Percona Discussion,ross....@avantalytics.com

Unfortunately, the fix doesn't seem to help. Estimates are better but still incorrect enough to cause full table scans. Please see bug report above.

Répondre à tous

Répondre à l'auteur

Transférer