libpmemblk/kernel btt code. how t avoid read/write same data blockers (ABA)?

Wu, Dennis

unread,

May 13, 2022, 11:50:01 PM5/13/22

to pm...@googlegroups.com

Following is the explain for the RTT https://docs.kernel.org/driver-api/nvdimm/btt.html?highlight=btt: Consider a case where we have two threads, one doing reads and the other, writes. We can hit a condition where the writer thread grabs a free block to do a new IO, but the (slow) reader thread is still reading from it. In other words, the reader consulted a map entry, and started reading the corresponding block. A writer started writing to the same external LBA, and finished the write updating the map for that external LBA to point to its new postmap ABA. At this point the internal, postmap block that the reader is (still) reading has been inserted into the list of free blocks. If another write comes in for the same LBA, it can grab this free block, and start writing to it, causing the reader to read incorrect data. To prevent this, we introduce the RTT.The RTT is a simple, per arena table with ‘nfree’ entries. Every reader inserts into rtt[lane_number], the postmap ABA it is reading, and clears it after the read is complete. Every writer thread, after grabbing a free block, checks the RTT for its presence. If the postmap free block is in the RTT, it waits till the reader clears the RTT entry, and only then starts writing to it.

But when I review the libpmemblk and kernel btt code and the RTT logic, I think the logic can’t avoid the read/write same data blockers. We don't have any lock for the RTT that means: Every writer thread, after grabbing a free block, checks the RTT for its presence. If the RTT is not written by the read thread, it will start to write the ABA, now the read thread write the RTT and start to read data, that will read the dirty data. Please help explain?

In my mind, we must have some lock to assure the execution order of the RTT write in the read thread and RTT read in the write thread. In the logic, we considering to use the map_locks to assure the different threads write the same LBA, while we don’t have any lock to assure the read/write same ABA.

I would like to hear more from you and appreciate you to make me clear on the logic.

BR,
Dennis Wu

Andy Rudoff

unread,

May 14, 2022, 11:10:37 AM5/14/22

to pmem

Hi Dennis,

First, some background. Vishal Verma and I designed BTT on a white board and I wrote it up into a spec (which was later added to the UEFI spec since the format is interpreted by the pre-boot environment when booting from a BTT device). Vishal and I decided to create separate implementations from the same spec to compare our results and check the correctness of our spec. I did the libpmemblk code and Vishal did the Linux kernel code. Therefor, my answers are for the libpmemblk code, but for RTT it looks like the implementations are quite similar.

Of course there could always be bugs remaining, but I can say I tested concurrent reads & writes extensively during development, showing that data corruption that happened without the RTT no longer happened when RTT code was added. I don't understand your comment about locking. The RTT contains aligned, 32-bit so loads and stores are atomic, so we just need to ensure ordering, which is done by adding memory barriers to the code. In your example:

>Every writer thread, after grabbing a free block, checks the RTT for its presence.

>If the RTT is not written by the read thread, it will start to write the ABA,

>now the read thread write the RTT and start to read data,

>that will read the dirty data. Please help explain?

The order of operations is this:

1. a write comes in, so a free block is allocated from the free list. By definition, no map entry points to this free block, so although there may be old reads still in progress from before the block was freed, no new reads can start reading this block.

2. the write checks the RTT and if it finds any threads still reading the free block, it waits for the reads to finish.

3. as old reads finish on that block, they clear the corresponding RTT value, allowing the wait in the previous step to complete.

4. when no more old reads are remaining, the write continues and starts updating the block.

5. your flow above says "now the read thread will write the RTT and start to read data" but that's not possible. no map entry points to the block being written by the write thread, so no new reads can start. and by the time we get to step 4, all old reads are finished.

Were you able to see any cases where RTT worked incorrectly in the current (unmodified by you) libpmemblk or Linux BTT code? If so, we definitely want to hear the details and fix the bug.

Thanks,

-andy

Wu, Dennis

unread,

May 15, 2022, 3:29:39 AM5/15/22

to Andy Rudoff, pmem

Hi Andy,

I think this is a corner case, it is hard to validate, but since we don’t know the detail execution order for every steps and I define the following steps and please take a look. We can see our step 12 and 10 are operating the same ABA.

BR,
Dennis Wu

--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/a691ef37-9744-4122-8eb5-e78f0093a1c4n%40googlegroups.com.

Andy Rudoff

unread,

May 15, 2022, 9:56:14 PM5/15/22

to pmem

Hi Dennis,

Thanks for the picture, that is very helpful. Let me spend some time going through it detail.

-andy

Andy Rudoff

unread,

May 15, 2022, 10:19:13 PM5/15/22

to pmem

Hi Dennis,

After walking through your picture, I believe this case is handled correctly by the current implementation. This comment in btt.c explains exactly the case in your picture and what we do about it:

* In case this thread was preempted between reading entry and
* storing it in the rtt, check to see if the map changed. If
* it changed, the block about to be read is at least free now
* (in the flog, but that's okay since the data will still be
* undisturbed) and potentially allocated and being used for
* another write (data disturbed, so not okay to continue).

Your picture is based on the idea that there's a pause between reading the map (step 4) and setting the RTT (step 11) so that steps 5 though 10 can happen during the pause. However, before the code executes what you labeled step 12, it does the check described in the comment above. That would cause it to re-read the map, set RTT, and begin reading from the new location of the block set in step 7. Therefor, the case you describe is prevented by the current implementation of libpmemblk, and I checked the Linux implementation and it has similar logic for the same case.

See here: https://github.com/pmem/pmdk/blob/master/src/libpmemblk/btt.c#L1553

Please let me know if that makes sense.

Thanks,

-andy

Wu, Dennis

unread,

May 16, 2022, 1:44:35 AM5/16/22

to Andy Rudoff, pmem

Yes, I got it. You are right, read the btt_map again to confirm if the map is updated, if updated, read another location. If not, at least the rtt set correctly.

Each postmap will be divided into the lanes, in the picture btt_write(lba1), btt_write(lba2) can not be run in parallel (in the same lane).

The logic is right, but for read, it need read the btt_map twice.

Thank you very much for the explanation.

BR,
Dennis Wu

To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/a412d1e9-382b-4cfd-8a08-b5ed49f06d94n%40googlegroups.com.

Reply all

Reply to author

Forward