Hey,
while playing around a bit, I've started to wonder how Intel implements non-temporal loads with PMem. There is a lot out there regarding non-temporal stores that bypass the cache hierarchy, but surprisingly little information on loads.
Intel Developer Manual states that non-temporal loads are only applied to Write-Combining (WC) memory. I'm aware that Optane has a write-combining buffer, but I don't know if Optane is classified as WC memory. The guide uses graphical memory as an example.
So I'm wondering if using a non-temporal load (e.g., via _mm512_stream_load_si512) actually bypasses the cache when reading data or if the non-temporal hint is ignored and data is pulled through the hierarchy. I'd appreciate any pointers to documentation that clarifies this behavior or any information on this in general. Or does anyone know a way to verify cache residency with a short C/C++ program?
Best,
Lawrence