how to use non-temporal (streaming) store instructions to store/load a self-defined struct?

558 views
Skip to first unread message

Emeth Obando

unread,
Mar 2, 2021, 10:28:57 AM3/2/21
to pmem
Hi all, 

I just start to use non-temporal store instructions to store some kinds of data to the memory (could be DRAM or NVM). 

I check out the Intel Intrinsics Guide for such storing functions and I find functions like _mm_stream_si32, _mm_stream_si18, _mm_stream_si256 etc. It seems that these kinds of functions can only be applied to some kinds of integers. 

My question is that if I self-define a certain type of struct and its size may be 1KB, 2KB ...... How can I perform non-temporal (streaming) stores to store such kinds of structs to my memory (or vice versa, load from memory). 

For now, I can only figure out one way, to cast my struct into a chunk of integers, and apply non-temporal/streaming store/load for each of the casted integers one-by-one.

 I think this method is somewhat inefficient, is there a more efficient way of coding to achieve my goal?

Also, if I want to store a large number of such self-defined struct, is it necessary to issue a sfence after every non-temporal store? I am not sure about that and wonder that if I could remove the sfence instruction or just issue one sfence instruction after performing all non-temporal stores?

Moreover, I found that the number of non-temporal streaming (load) functions is very limited. I only found one function, _mm_stream_load_si128, are there any other functions for loading?

Summary: 
  • How to use non-temporal (streaming) store instructions to store/load a self-defined struct?
  • If one self-defines a certain type of struct and its size may be 1KB, 2KB, how can I perform non-temporal (streaming) stores to store such kinds of structs to memory (or vice versa, load from memory)?

  • Is it necessary to cast a struct into a chunk of integers, and apply non-temporal/streaming store/load for each of the casted integers one-by-one?
  • If one wants to store a large number of such self-defined struct, is it necessary to issue a sfence after every non-temporal store? Is it possible to remove the sfence instruction or just issue one sfence instruction after performing all non-temporal stores?
  • The number of non-temporal streaming (load) functions is very limited, only one function, _mm_stream_load_si128. Are there any other functions for loading?
Many thanks for the help :) 

ppbb...@gmail.com

unread,
Mar 2, 2021, 10:49:15 AM3/2/21
to pmem
Hi,
I would recommend against rolling your own low-level mechanisms like that. Instead, you can leverage nontemporal memcpy implemented in libpmem/libpmem2.
Here's an example that shows its usage:

This way you don't have to deal with all the details you've listed in your question.

Alternatively, you can find a lot of useful information (including answers to your questions) in the x86 Architectures Optimization Manual:

Look at sections 9.6 and 15.16.

Piotr

Emeth Obando

unread,
Mar 3, 2021, 9:19:45 AM3/3/21
to pmem
Hi Piotr and all, 

Thank you for the information. 

I am wondering if I should issue a `fence` instruction whenever I perform one such non-temporal `memcpy` (or non-temporal functions like, _mm_stream_si32, _mm_stream_load_si128, etc )?  

ppbb...@gmail.com

unread,
Mar 3, 2021, 9:48:04 AM3/3/21
to pmem
The implementation found in libpmem/libpmem2 performs all the necessary operations to make sure everything works.
If you want to use the intrinsics manually - as I said, I recommend familiarizing yourself with the relevant parts of the manual linked above. But yes, you need to use appropriate fencing operations with non-temporal stores. This is explained in more detail in section 9.4.1.1.

Piotr
Reply all
Reply to author
Forward
0 new messages