For what purpose that commit timestamps must not be set in the past of any read timestamp that has been used?

孔德雨

unread,

Mar 18, 2020, 8:55:08 AM3/18/20

to wiredtiger-users

From wt's manual, I found:

Quoted

"Commit timestamps cannot be set in the past of any read timestamp that has been used. This is enforced by assertions in diagnostic builds, if applications violate this rule, data consistency can be violated."

I believe the " assertions in diagnostic builds" is referring to "__txn_assert_after_reads"

It is hard for me to construct some situation when the violation of this condition leads to data inconsistency.

Would someone give some example, such as two racing threads, that may detail this document?

Best Regards

Deyu Kong.

孔德雨

unread,

Mar 20, 2020, 4:38:00 AM3/20/20

to wiredtiger-users

I have found a clue to let this be a necessary condition, But I can not tell if this is a necessary and sufficient condition.

This condition makes sense in distributed-txn. I draw a picture to describe:

TxnA ----(W)----> MongoS -------> (node1) ----------[A start] ---------------[B start] -------------[A prepare] ------------------[B read]

| |

TxnB -----(R)----------| |--------> (node2) ----------[A start] --------------[A prepare] ----------[B start] ---------------[B read]

There are two txns, TxnA only writes, and TxnB only reads.

There are two nodes node1 and node2.

The wall-clock events in node1 and node2 happen in the sequence describe in the pic above.

In node1, B starts before A prepares. So from the aspect of SI, The prepare action is invisible to B.

In node2, B starts after A prepares, as wiredTiger releases snapshots after prepare finishes, so B will see A's prepare action.

The distributed txn must obeys SI, and in node1, A is invisible to B from SI's aspect. So in node2, there should be some mechanism to make A invisible to B.

That has to be the timestamp mechanism.

When A prepares, it negotiates the greatest HLC in all nodes as the prepare timestamp,so it definitely is larger than B's read timestamp.

If this is violated, A will be partially visible to B, which violates SI.

I wonder if this is sufficient , I wonder how mongo guarantees this.

More or less, I made progress on this puzzle.

在 2020年3月18日星期三 UTC+8下午8:55:08，孔德雨写道：

孔德雨

unread,

Mar 20, 2020, 5:59:45 AM3/20/20

to wiredtiger-users

update a typo:

When A prepares, it negotiates the greatest HLC in all nodes as the prepare timestamp,so it definitely is larger than B's read timestamp.

I mean:

When A prepares, it negotiates the greatest HLC(prepareTs) in all nodes as the Commit timestamp,so it definitely is larger than B's read timestamp.

在 2020年3月20日星期五 UTC+8下午4:38:00，孔德雨写道：

Michael Cahill

unread,

Mar 23, 2020, 10:50:11 PM3/23/20

to wiredtiger-users

Hi Deyu Kong,

The condition on commit timestamps not being in the past of any read timestamp exists so that reads at a timestamp are repeatable.

Imagine a read runs at timestamp 100. Then if a commit is permitted with timestamp 90, repeating the read would produce a different result.

Kind regards,

Michael.

Michael Cahill

unread,

Mar 23, 2020, 11:03:01 PM3/23/20

to wiredtiger-users

Hi Deyu Kong,

For the distributed read/write case, you are correct that both transactions needs to satisfy the SI conditions, so that B must either see all of A or none of it. That is achieved in two parts: (1) if B reads an update from A while it is prepared, B will retry the read (with backoff) waiting for A to be resolved, and (2) MongoDB's implementation of 2PC, which as you note will choose a commit timestamp greater than or equal to any of its prepare timestamps. Thus once A is resolved, it will have the same effective commit timestamp on both nodes and thus will either all be visible, or none of it will.

The ordering constraints on prepare vs commit vs durable timestamp for a given transaction are also critical in providing consistency at runtime and for crash recovery. The overall constraint on prepare timestamps being newer than any read timestamp (the equivalent for distributed transactions of the constraint you mentioned on commit timestamps) is necessary to guarantee repeatable reads for distributed transactions.

I hope that helps: let me know if more details would be useful.

Michael.

孔德雨

unread,

Mar 24, 2020, 11:28:37 PM3/24/20

to wiredtiger-users

Hi, Micheal,

Thanks for your detailed response.

I have some questions about this comment of yours. quoted below:

Imagine a read runs at timestamp 100. Then if a commit is permitted with timestamp 90, repeating the read would produce a different result.

Image the sequce below:

Op1: TxnA start

Op2: TxnB start

Op3: TxnB readAt key=a ts=100, Get Nothing

Op4: TxnA write key=a ts=90

Op5: TxnA commit

Op6: TxnB readAt key=a ts=100, Get What?

I do believe that in wiredTiger, in Op6, TxnB will not see TxnA's commit in Op5, because in wt, __wt_txn_visible will first check SI by __txn_visible_id, if update is invisible by SI check, timestamp check will not be considered.

So does your comment mean "repeat the read by opening new transactions, and those new transactions will read a result different than TxnB?

If the answer is "YES", I would ask one more question: "what standard does this behavior violate? since the ANSI-SQL-92-REPEATABLE-READ is describing something about two or more parallel-running transactions".

And, wt before version3.0, implemented the standard SI isolation level in this paper: "A Critique of ANSI SQL Isolation Levels ", However, as timestamp is introduced since version3.0, it seems more likely a private standard which is specially for mongodb and no literal standards or papers like ANSI-SQL-92 have announced what to do and what not to do, am I right?

在 2020年3月24日星期二 UTC+8上午10:50:11，Michael Cahill写道：

Michael Cahill

unread,

Mar 26, 2020, 2:21:18 AM3/26/20

to wiredtiger-users

Hi Deyu Kong,

The operations 4-5 (committing with an earlier timestamp) are not permitted -- that's what started this thread. The results of a query after that point are formally "undefined" -- i.e., we don't make any promises because the API contract has been violated.

You are correct that WiredTiger's current implementation first checks execution order using the internally assigned transaction IDs, and only then checks for timestamp visibility. The main point is when the API is used as expected, a read at timestamp 100 is well defined: that is, it will see the same results regardless of any other (valid) operations that are executed concurrently.

WT still provides "standard" SI. If used without timestamps, it provides strict SI so that a transaction T will see the results of all committed transactions before T started. When used with timestamps, it provides Generalized SI [1], where T's read timestamp determines the prefix of committed transactions that are visible to T.

Kind regards,

Michael.

[1] https://www.researchgate.net/publication/37423502_Generalized_Snapshot_Isolation_and_a_Prefix-Consistent_Implementation

孔德雨

unread,

Mar 28, 2020, 2:11:27 AM3/28/20

to wiredtiger-users

Hi, Michael Cahill

If I have understood this paper correctly, I believe "prefix-consistency generalized SI" can guarantee

1) different transactions starts at different time(parallelly not not does not matter) reads to different nodes in a replicaset with the same timestamp A get the same result.

2) Intuitively, the timestamp defines the "SI" in distributed database. "prefix-consistency generalized SI" eliminates snapshot holes to different transactions across different replset-nodes, leading to the stable or repeatable reads, even in non-parallel transactions.

This is relatively new to me, but not difficult to understand.

Thanks for your information.

Best Regards

Deyu

在 2020年3月26日星期四 UTC+8下午2:21:18，Michael Cahill写道：

Reply all

Reply to author

Forward