Hi Tobias,
Thanks for the interesting post, complete with interesting links!
One of the goals of the PMem programming model was to make PMem follow the memory semantics of the platform once it is mapped into the application's address space. We didn't want to try to force an unnatural memory model into a platform with a different model. For that reason, the answers your questions could be different for different platforms. I'm not an ARM expert, so I can only provide an answer for x86 -- hope that's useful.
On x86, PMem is supported by creating a "persistence domain" which extends a small amount beyond the memory device -- it includes the write pending queue in the memory controller (or more than that if the platform is built with the CPU caches in the persistence domain, but that case obviously avoids the problem we're discussing, so I'll stick to platforms where the CPU caches are not in the persistence domain). I bring this up because you mention JEDEC and how data is sent to the DIMM, but that's not what matters for persistence -- it is the point at which the store is accepted by the memory subsystem. Again, this may very well be specific to x86 systems -- I'm not speaking to how an ARM system works.
When people talk about an atomic operation, they are typically asking the "all or nothing" question. Although a cache line flush on x86 is very likely to write all 64 bytes atomically to the memory subsystem on modern server CPUs, it is not architecturally guaranteed -- an implementation is free to break up the store into multiple chunks along the way. A recent instruction, MOVDIR64B, does provide an architecturally guaranteed way to send 64-bytes to the memory subsystem without it getting broken up. So one way to ensure the algorithm you mention works correctly is to use MOVDIR64B. However, I think the algorithm you link to will work fine as long as the ordering of writes during the shift operation is done correctly. The paper describes how the ordering of the writes is done carefully so that if a line is evicted at any point, the worst thing that can happen is a duplicate entry is found after reboot, and the algorithm is designed so that those duplicate entries are handled correctly. The issue really isn't whether the flush operation is atomic, in my opinion, it is whether evictions, which can happen any time due to other system activity, can cause an inconsistent state. It seems to me the clever part of the paper is how they avoid the inconsistent state by defining what happens if an eviction creates the duplicate.
Now to answer your question directly, can part of the cache line reach PMem and the rest is dropped due to a power failure? Although it is unlikely, there is not an architectural guarantee that all 64-bytes travel to the persistence domain as a unit unless you use MOVDIR64B. But even if x86 were to make that architectural guarantee in the future, since nothing prevents evictions at any point, SW would still have to deal with the case where it is in the middle of storing the 64-bytes of data and some of it becomes persistent before SW can execute the rest of the stores.
-andy