Thanks for the testing on this Mark.
First off, let me cover the analysis angle. The auto analyze for cardinality does not impact the row count. The FT library tries to maintain a logical running row count, and that can be quite fuzzy as I will explain a bit later.
The automatic background analysis only recalculates cardinality statistics for the range of rows that it is permitted time-wise to cover and assumes that the distribution is somewhat the same across the data range. Internally, the server expects us to report back to it the number of records per key and key part. I see where the cofusion is on how to disable as the docs do not mention it. I do remember specifying how to disable but it must have gotten lost along the line. I am going to go back and figure out how this happened and ensure the docs are updated. It is actually quite simple, in your my.cnf, set tokudb_auto_analyze=0
As you can see, once a reported row count hits 0, cardinality information is useless, the optimizer seems to just ignore it and revert to table scans. Disabling the auto analysis will have no effect on the inaccurate row counts.
Now on to what I see is really your issue. Back in PS 5.6.32-78.0 and 5.7.14-7 we fixed
https://tokutek.atlassian.net/browse/FT-732 and
https://tokutek.atlassian.net/browse/DB-1006. This issue matches your description precisely which is somewhat concerning and indicates that we missed something and did not uncover it in testing.
The issue comes down to deciding how you adjust a running logical row count (LRC) when you have a data structure that supports blind and deferred operations (messages). For example, a delete that might not have a target record, or an update that turns into an insert as the result of a missing record, or an insert that turns into an update due to a record collision. All of these are real things that occur within the Fractal Tree and their ultimate result on the LRC can not be pre-determined. Only when the messages are physically and permanently delivered to the leafs can the impact be known.
So what we have in the FT is a 'guess and correct' system. It guesses how to adjust the LRC based on the message type when it is dropped into the top of the tree, then corrects the LRC when the message reaches the leaf entry (row) and is applied. One interesting and pertinent data point is that FT treats all updates as inserts. So, an update at the top of the tree is converted to an insert message and as such immediately increments the LRC. When that insert message is applied to an existing record, it is recognized as an update and therefore then 'corrects' the LRC by decrementing it.
This sounds simple enough until you add in searches/scans/queries. In order for a query to return the transactional truth, all of the pending messages above the target leaf entries must be pulled down and applied. This activity results in the LRC being adjusted by the 'correction' logic, meaning that inserts (which are actually updates) correct the LRC downward to counter the upward bump it got when the message was originally pushed into the tree. For these queries, this activity is NOT persistent, meaning the messages remain in the tree and the leaf node is not made dirty. Only when messages are pushed downwards via message flusher, gorged node, node splits/merges, or other write operations do things get persisted. This causes problems then when nodes get evicted from memory that have 'corrected' the LRC but not been persisted. It means then that that adjustment must be undone because from a persisted truth point of view, the LRC should have never changed. This is where I think the problem lies in your test case, we are applying updates (inserts), causing the LRC to decrement downwards, somehow evicting the node without 'un-correcting' the LRC, re-reading the node again and repeating the process until we get to 0. The update in your last post indicates that it seems worse when the trees don't fit into memory. This really fits the situation I described as it means more evictions and points to the possibility that there is some node eviction path that is not properly 'un-correcting' the LRC.
I am going to re-open the FT and DB issues and see if I can figure out where the LRC is not being un-corrected on eviction and will post back with any discovery.