Distributed transaction timestamp

Scooby Doo

unread,

Nov 1, 2019, 5:48:33 AM11/1/19

to mongodb-dev

I met a question about MongoDB's distributed transaction design.

截屏2019-11-0117.34.17.png

The situation is that:

1. Transaction T2 gets a timestamp S2 from the cluster, due to the clock skew, this timestamp S1 is bigger than S1

2. Before T2 reaches the shard, transaction T1 commits, and gets the timestamp S1

3. Before T2's commitment, T2 reads the conflicting record

4. According to the timestamp semantic, T2 should successfully read T1's data. But according to the transaction semantic, T1 is concurrent with T2, so it could not read T1's data

To my own understanding, the storage engine could solve this problem through blocking the reading transaction who meets uncommitted data and undetermined commit timestamp. For example:

截屏2019-11-0117.45.14.png

In this case, transaction T2 will be blocked until T1's commitment, due to T1's committing state.

I've read MongoDB's source code, but I could not find any corresponding solution. Could you give me some hint? Thanks!

Scooby Doo

unread,

Nov 7, 2019, 1:12:58 AM11/7/19

to mongodb-dev

Oh, nobody wants to answer this question. Is it too rude or private?

Esha Maharishi

unread,

Nov 7, 2019, 5:10:18 PM11/7/19

to mongodb-dev

Hi Scooby,

If I understand, you mean that T2 is a snapshot read whose atClusterTime was chosen by the router to be cluster time S2, and the read hits a shard that has reserved time S1 < S2 to commit some transaction, but the commit has not been made visible yet.

The question is does the snapshot read block until the write at S1 is visible, and if so, how.

The answer is yes, the read does block. Because the read's cluster time is greater than the last applied time on the shard, the read will do a noop write on the shard to advance the shard's lastAppliedOpTime to S2, then call waitUntilOpTimeForRead with S2.

Let me know if this answers your question.

Best,

Esha

Scooby Doo

unread,

Nov 7, 2019, 9:16:58 PM11/7/19

to mongodb-dev

Thank you for explanation.

I understand the noop write and waitUntilOpTimeForRead mechanism. But I don't think it could solve this problem.

The scenario is, T2 already make a noop write and wait the lastAppliedTime until S2, but T1 is still not visible. So T2 still could not read T1's data until T1's commitment, which violates snapshot timestamp (snapshot read could read any transaction whose commit timestamp smaller than the read timestamp).

在 2019年11月8日星期五 UTC+8上午6:10:18，Esha Maharishi写道：

Esha Maharishi

unread,

Nov 8, 2019, 9:46:44 AM11/8/19

to mongodb-dev

No problem.

To answer your follow-up question:

After the noop write, waitForOpTimeUntilRead should wait for all "holes" to be filled as part of this call to waitForAllEarlierOplogWritesToBeVisible.

Scooby Doo

unread,

Nov 9, 2019, 2:44:26 AM11/9/19

to mongodb-dev

Interesting. That's definitively answer my question.

waitForAllEarlierOplogWritesToBeVisible is a concise solution for the commitment visible problem. However, it will introduce redundant waiting, when the transaction tries to read irrelevant record with the "hole" in oplog. Maybe partition the key space into a fine granularity could reduce the irrelevant waiting. Of course, there's a gap between theory and practice.

Anyway, thank you for explanation. I will keep learning the mechanism of replication and transaction of MongoDB.

在 2019年11月8日星期五 UTC+8下午10:46:44，Esha Maharishi写道：

Reply all

Reply to author

Forward