MLC latency vs bandwidth, pmem

783 views

Skip to first unread message

Ranjan Sarpangala Venkatesh

unread,

Oct 15, 2020, 3:30:35 PM10/15/20

to pm...@googlegroups.com

Hi,

I collected latency vs bandwidth numbers for read/write and remote/local accesses of Optane in App direct - interleaved mode using MLC v3.9 with --loaded-latency option.

Two questions
1.The bandwidth numbers are reasonable. However, the latency numbers seem to be of DRAM.

cat results/loaded-latency/write-local-24thread.csv
latency,bandwidth
65,7.24
62,7.27
62,7.28
62,7.28
63,7.25
62,7.28
63,7.30
62,7.43
62,8.55
62,8.98
62,7.12
62,5.34
62,4.37
62,3.59
62,2.77
62,2.27
62,1.90
62,1.50
63,1.22

cat results/loaded-latency/read-local-24thread.csv
latency,bandwidth
64,31.81
63,31.83
62,31.84
62,31.76
65,31.52
63,31.65
62,31.69
62,31.73
62,31.73
62,31.40
62,25.06
62,18.28
62,14.41
62,11.32
62,8.07
62,6.07
63,4.55
62,2.99
63,1.88

2. Is it possible to get the latency thread to do writes instead of reads on MLC?

I looked at the options in the manual for MLC v3.9, -o input file does not seem to have an option to configure the latency thread, other than the CPU number.

Thanks

Regards

Ranjan

Anton Gavriliuk

unread,

Oct 15, 2020, 4:35:18 PM10/15/20

to Ranjan Sarpangala Venkatesh, pmem

Hi Ranjan

> 1.The bandwidth numbers are reasonable. However, the latency numbers seem to be of DRAM.

I don't think that mlc measures PMEM in AppDirect mode. You can measure read latencies generating some read PMEM traffic and using

./pcm-latency.x -pmm

you will see

PMM read Latency(ns)
Socket0: 0.00
Socket1: 0.00

When you are talking about PMEM write latencies, you have to understand that firstly you will write to PMEM WPQ (Write Pending Queue buffers), not directly to 3DXP media. I made this conclusion because my PMEM writes are faster than reads.

Anton

чт, 15 окт. 2020 г. в 22:30, Ranjan Sarpangala Venkatesh <ranj...@gmail.com>:

--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/CA%2BYNxZOw0NSZNBbwDtFMnDE4usG7a0dCi3fh2G3uKJjoe-pAwA%40mail.gmail.com.

Steve Scargall

unread,

Oct 15, 2020, 7:03:01 PM10/15/20

to pmem

On Thursday, October 15, 2020 at 1:30:35 PM UTC-6 ranj...@gmail.com wrote:

1.The bandwidth numbers are reasonable. However, the latency numbers seem to be of DRAM.

Agreed. See below for how to get PMem latencies. The data was probably being cached in the PageCache if you didn't mount the file systems with 'o- dax'.

2. Is it possible to get the latency thread to do writes instead of reads on MLC?

Yes, it is possible with MLC to measure idle latency, loaded latency, and bandwidth.

For the following, I assume a 2-Socket server configured in AppDirect where each Region has a single FSDAX namespace mounted with the DAX option:

--- Setup ---

$ sudo ipmctl create -goal PersistentMemoryType=AppDirect

$ sudo systemctl reboot

$ sudo ndctl create-namespace --continue

$ sudo mkfs.ext4 /dev/pmem0

$ sudo mkfs.ext4 /dev/pmem1

$ sudo mkdir /pmemfs0 /pmemfs1

$ sudo mount -o dax /dev/pmem0 /pmemfs0

$ sudo mount -o dax /dev/pmem1 /pmemfs1

This system has the following NUMA layout

# numactl -H

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

node 0 size: 192114 MB

node 0 free: 187552 MB

node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

node 1 size: 192012 MB

node 1 free: 191380 MB

node distances:

node 0 1

0: 10 21

1: 21 10

--- Sequental Idle Latency Tests (Local NUMA) ---

// Use the first CPU on Socket 0/1 to perform the test to the local socket PMem

# mlc --idle_latency -c0 -J/pmemfs0

# mlc --idle_latency -c24 -J/pmemfs1

--- Sequential Idle Latency Tests (Remote NUMA) ---

// Use the first CPU on Socket 0/1 to perform the test to the remote socket PMem

# mlc --idle_latency -c24 -J/pmemfs0

# mlc --idle_latency -c0 -J/pmemfs1

--- Random Idle Latency Tests (Local NUMA) ---

// Use the first CPU on Socket 0/1 to perform the test to the local socket PMem

# mlc --idle_latency -l256 -c0 -J/pmemfs0

# mlc --idle_latency -l256 -c24 -J/pmemfs1

--- Random Idle Latency Tests (Remote NUMA) ---

// Use the first CPU on Socket 0/1 to perform the test to the remote socket PMem

# mlc --idle_latency -l256 -c24 -J/pmemfs0

# mlc --idle_latency -l256 -c0 -J/pmemfs1

--- Sequential Read Loaded Latency Tests ---

// Create a file containing delays between memory accesses, thus simulating an app doing compute work

// These numbers are the defaults used by MLC loaded latency. Feel free to use your own.

# cat <<EOF > loaded_latency_delays

00000

00002

00015

00050

00100

00200

00300

00400

00500

00700

01000

01300

01700

02500

03500

05000

09000

20000

EOF

// Create a Per Thread test definition file as input to MLC.

// Here, we create a single thread test and another test using all threads on Socket0

# echo "0 R seq 100000 pmem /pmemfs0" > PMem_PERTHREAD

# echo "0-23,48-71 R seq 100000 pmem /pmemfs0" >> PMem_PERTHREAD

// Run the MLC test using the delay and per thread input files. Each test runs for 5seconds

# mlc --loaded_latency -g./loaded_latency_delays -o./PMem_PERTHREAD -t5 > llat_seq_READ.txt

--- Random Read Loaded Latency Tests ---

// Use the same loaded_latency_delays

// Create a Per Thread test definition file as input to MLC.

// Here, we create a single thread test and another test using all threads on Socket0

# echo "0 R rand 100000 pmem /pmemfs0" > PMem_PERTHREAD

# echo "0-23,48-71 R rand 100000 pmem /pmemfs0" >> PMem_PERTHREAD

// Run the MLC test using the delay and per thread input files. Each test runs for 5seconds

# mlc --loaded_latency -g./loaded_latency_delays -o./PMem_PERTHREAD -t5 > llat_rand_READ.txt

--- Max Bandwidth Tests ---

MLC has a wide variety of read/write tests built-in. See the '-w' option:

-Wn where n means

2 - 2:1 read-write ratio

3 - 3:1 read-write ratio

4 - 3:2 read-write ratio

5 - 1:1 read-write ratio

6 - 0:1 read-Non Temporal Write ratio

7 - 2:1 read-Non Temporal Write ratio

8 - 1:1 read-Non Temporal Write ratio

9 - 3:1 read-Non Temporal Write ratio

10 - 2:1 read-Non Temporal Write ratio (stream triad-like)

Same as -W7 but the 2 reads are from 2 different buffers

11 - 3:1 read-Write ratio (stream triad-like with RFO)

Same as -W3 but the 2 reads are from 2 different buffers

12 - 4:1 read-Write ratio

21 - 100% read with 2 addr streams - valid with only -o option

23 - 3:1 read-write ratio with 2 addr streams - valid with only -o option

27 - 2:1 read-NT write with 2 addr streams - valid with only -o option

To use this, we need to create an input file, just like we did for the loaded latency tests.

// Create a workload file

# cat <<EOF > PMem_PERTHREAD

#CPUs Traffic type seq or rand buffer size pmem or dram pmem path

0-23,48-71 W2 rand 100000 pmem /pmemfs0

EOF

// Run the test

# mlc --loaded_latency -d0 -o./PMem_PERTHREAD -t10 -T -Z

// Using the -w options, you can run additional tests (one test per PMem_PERTHREAD file), such as:

#CPUs Traffic type seq or rand buffer size pmem or dram pmem path

0-23,48-71 R seq 100000 pmem /pmemfs0

0-23,48-71 R rand 100000 pmem /pmemfs0

0-23,48-71 W2 seq 100000 pmem /pmemfs0

0-23,48-71 W2 rand 100000 pmem /pmemfs0

0-23,48-71 W5 seq 100000 pmem /pmemfs0

0-23,48-71 W5 rand 100000 pmem /pmemfs0

0-23,48-71 W6 seq 100000 pmem /pmemfs0

0-23,48-71 W6 rand 100000 pmem /pmemfs0

0-23,48-71 W7 seq 100000 pmem /pmemfs0

0-23,48-71 W7 rand 100000 pmem /pmemfs0

[...]

--- Ramp Up Bandwidth Tests ---

If you take the same approach as 'Max bandwidth' testing, you can achieve ramp up testing by changing the number of threads that MLC uses. eg:

#CPUs Traffic type seq or rand buffer size pmem or dram pmem path

0 R seq 100000 pmem /pmemfs0

0-1 R seq 100000 pmem /pmemfs0

0-3 R seq 100000 pmem /pmemfs0

0-7 R seq 100000 pmem /pmemfs0

0-15 R seq 100000 pmem /pmemfs0

[...]

--- Summary ---

The above approaches can be extended to achieve a multitude of tests including, but not limited to:

HyperThreading tests (if Hyperthreading is enabled, run the tests on the first CPU Thread only using -X)
Remote PMem testing. ie: Run MLC on Socket0 while accessing the PMem on Socket1. This is useful to observe the UPI and Directory overheads, then you can experiment with the BIOS options such as Snoopy Mode, QoS Recipes, etc to achieve better performance for a given workload.
The workload definition files allow you to perform either DRAM or PMem, or Both in a single test.
Use ' -P CLFLUSH used to evict stores to persistent memory' to perform flushing to observe latency and bandwidths
Instead of the loaded_latency_delays, use '-r' to generate random latencies
Lots more. MLC has a ton of options. There's lots of fun to be had.