Cache line relation to the topology settings

40 views

Skip to first unread message

Sajad Karim

unread,

Nov 15, 2022, 1:26:30 PM11/15/22

to pmem

Hi Everyone,

I have two questions regarding the activation/consumption of cache lines. Actually, I want to learn in what circumstances the driver consumes one or all four cache lines.

I will try to be brief and clear as much as possible, and my queries are;

As per the official datasheet at https://www.intel.de/content/www/de/de/products/docs/memory-storage/optane-persistent-memory/optane-dc-persistent-memory-brief.html, the performance of the product depends upon the usage of the cache lines (64B or 256B).

My first question is; Is there any way to find out what topology settings for DRAM and PMem would yield what theoretical results? For instance, what bandwidth, or how many cache lines would be involved in the topology of 6-DRAM to 4-PMem configuration?
I have written a small program (using libpmem) to verify the numbers for the read and the write operations. In the program, I simply write and read a 5 GB object sequentially, and I do it in different block sizes, from 64 bytes to 128 MB block sizes. I have attached the code to this post.

The current topology configured is 6-DRAM to 4-PMem DIMMs. Here is the plot;

As per the above-mentioned link; with cache lines 64B and 256B, bandwidth for the read can reach up to 1.7 and 6.8 GB/s respectively. However, the numbers for the read suggest it consumed more than one cache line, but it could not consume all 4 cache lines, because in that case the whole operation (for block sizes > 256 bytes) would have been finished in under a second. Moreover, when the block size is 64B, the read operation should take around 3 seconds (5 GB / 1.7 GB/s), but it took just 1.76 secs which becomes around 2.84 GB/s and it is beyond the limit of the single cache line. Am I missing something here?
Now, regarding the write operation, with block size 64B, I get around 0.55 GB/s (which is a bit off because, with cache lines 64B and 256B, bandwidth for the write can reach up to 0.45 and 1.85 GB/s respectively). Also, with block size 512B, I get around 2.33 GB/s which is still higher than the official numbers.

Considering the above numbers, can I assume that more than one cache line is used?

For example, for read operations and the block size of 4 KB, two cache lines are consumed (5 GB / 1.45 sec = 3.44 GB/s) - as with a single cache line bandwidth can reach up to 1.7 GB/s.

Also, for write and block size of 512 bytes, more than five cache lines are used (5 GB / 2.11 sec = 2.37 GB/s) - as with a single cache line bandwidth can reach up to 0.45 GB/s. But the current topology has 4 PMems to 6 DRAMs, so I think my assumption is wrong.

In summary, most of the numbers in the plot do not confirm the numbers mentioned in the official link. Could anyone please help me demystify the behavior?

Thank you for your time.

Regards,
Karim.

code_snippet.txt

Eduardo Berrocal

unread,

Nov 17, 2022, 7:08:59 PM11/17/22

to pmem

Internal block size in PMEM modules is 256B, which contrasts with DRAM "block size" which is a cache line (64B).

So, short answer to your question is yes: more than one cache line is used (or, could be used). To get the best BW in PM, you have to make sure you align your I/O operations so they do 256B at a time instead of just 64B, which cuts your BW by 1/4. Given that you are using sequential I/O, you are most likely doing I/O in 256B chunks.