MLC latency vs bandwidth, pmem

783 views
Skip to first unread message

Ranjan Sarpangala Venkatesh

unread,
Oct 15, 2020, 3:30:35 PM10/15/20
to pm...@googlegroups.com
Hi, 

I collected latency vs bandwidth numbers for read/write and remote/local accesses of Optane in App direct - interleaved mode using MLC v3.9 with --loaded-latency option. 

Two questions
1.The bandwidth numbers are reasonable. However, the latency numbers seem to be of DRAM. 

cat results/loaded-latency/write-local-24thread.csv
latency,bandwidth
65,7.24
62,7.27
62,7.28
62,7.28
63,7.25
62,7.28
63,7.30
62,7.43
62,8.55
62,8.98
62,7.12
62,5.34
62,4.37
62,3.59
62,2.77
62,2.27
62,1.90
62,1.50
63,1.22

cat results/loaded-latency/read-local-24thread.csv
latency,bandwidth
64,31.81
63,31.83
62,31.84
62,31.76
65,31.52
63,31.65
62,31.69
62,31.73
62,31.73
62,31.40
62,25.06
62,18.28
62,14.41
62,11.32
62,8.07
62,6.07
63,4.55
62,2.99
63,1.88

2. Is it possible to get the latency thread to do writes instead of reads on MLC?

I looked at the options in the manual for MLC v3.9, -o input file does not seem to have an option to configure the latency thread, other than the CPU number. 

Thanks

Regards
Ranjan

Anton Gavriliuk

unread,
Oct 15, 2020, 4:35:18 PM10/15/20
to Ranjan Sarpangala Venkatesh, pmem
Hi  Ranjan

1.The bandwidth numbers are reasonable. However, the latency numbers seem to be of DRAM. 

I don't think that mlc measures PMEM in AppDirect mode. You can measure read latencies generating some read PMEM traffic and using 

./pcm-latency.x -pmm

you will see

PMM read Latency(ns)
Socket0: 0.00
Socket1: 0.00

When you are talking about PMEM write latencies, you have to understand that firstly you will write to PMEM WPQ (Write Pending Queue buffers), not directly to 3DXP media.  I made this conclusion because my PMEM writes are faster than reads.

Anton

чт, 15 окт. 2020 г. в 22:30, Ranjan Sarpangala Venkatesh <ranj...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/CA%2BYNxZOw0NSZNBbwDtFMnDE4usG7a0dCi3fh2G3uKJjoe-pAwA%40mail.gmail.com.

Steve Scargall

unread,
Oct 15, 2020, 7:03:01 PM10/15/20
to pmem
On Thursday, October 15, 2020 at 1:30:35 PM UTC-6 ranj...@gmail.com wrote:
1.The bandwidth numbers are reasonable. However, the latency numbers seem to be of DRAM. 

Agreed. See below for how to get PMem latencies. The data was probably being cached in the PageCache if you didn't mount the file systems with 'o- dax'. 
 
2. Is it possible to get the latency thread to do writes instead of reads on MLC?

Yes, it is possible with MLC to measure idle latency, loaded latency, and bandwidth.

For the following, I assume a 2-Socket server configured in AppDirect where each Region has a single FSDAX namespace mounted with the DAX option:

--- Setup ---
$ sudo ipmctl create -goal PersistentMemoryType=AppDirect
$ sudo systemctl reboot
$ sudo ndctl create-namespace --continue
$ sudo mkfs.ext4 /dev/pmem0
$ sudo mkfs.ext4 /dev/pmem1
$ sudo mkdir /pmemfs0 /pmemfs1
$ sudo mount -o dax /dev/pmem0 /pmemfs0
$ sudo mount -o dax /dev/pmem1 /pmemfs1

This system has the following NUMA layout

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 192114 MB
node 0 free: 187552 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 192012 MB
node 1 free: 191380 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10


--- Sequental Idle Latency Tests (Local NUMA) ---
// Use the first CPU on Socket 0/1 to perform the test to the local socket PMem
# mlc --idle_latency -c0 -J/pmemfs0
# mlc --idle_latency -c24 -J/pmemfs1

--- Sequential Idle Latency Tests (Remote NUMA) ---
// Use the first CPU on Socket 0/1 to perform the test to the remote socket PMem
# mlc --idle_latency -c24 -J/pmemfs0
# mlc --idle_latency -c0 -J/pmemfs1

--- Random Idle Latency Tests (Local NUMA) ---
// Use the first CPU on Socket 0/1 to perform the test to the local socket PMem
# mlc --idle_latency -l256 -c0 -J/pmemfs0
# mlc --idle_latency -l256 -c24 -J/pmemfs1

--- Random Idle Latency Tests (Remote NUMA) ---
// Use the first CPU on Socket 0/1 to perform the test to the remote socket PMem
# mlc --idle_latency -l256 -c24 -J/pmemfs0
# mlc --idle_latency -l256 -c0 -J/pmemfs1

--- Sequential Read Loaded Latency Tests ---
// Create a file containing delays between memory accesses, thus simulating an app doing compute work
// These numbers are the defaults used by MLC loaded latency. Feel free to use your own.
# cat <<EOF > loaded_latency_delays
00000
00002
00015
00050
00100
00200
00300
00400
00500
00700
01000
01300
01700
02500
03500
05000
09000
20000
EOF

// Create a Per Thread test definition file as input to MLC. 
// Here, we create a single thread test and another test using all threads on Socket0
# echo "0  R seq  100000 pmem /pmemfs0" > PMem_PERTHREAD
# echo "0-23,48-71 R seq  100000 pmem /pmemfs0" >> PMem_PERTHREAD

// Run the MLC test using the delay and per thread input files. Each test runs for 5seconds
# mlc --loaded_latency -g./loaded_latency_delays -o./PMem_PERTHREAD -t5 > llat_seq_READ.txt


--- Random Read Loaded Latency Tests ---
// Use the same loaded_latency_delays

// Create a Per Thread test definition file as input to MLC. 
// Here, we create a single thread test and another test using all threads on Socket0
# echo "0  R rand  100000 pmem /pmemfs0" > PMem_PERTHREAD
# echo "0-23,48-71 R rand  100000 pmem /pmemfs0" >> PMem_PERTHREAD

// Run the MLC test using the delay and per thread input files. Each test runs for 5seconds
# mlc --loaded_latency -g./loaded_latency_delays -o./PMem_PERTHREAD -t5 > llat_rand_READ.txt

--- Max Bandwidth Tests ---
MLC has a wide variety of read/write tests built-in. See the '-w' option:

  -Wn where n means
        2  - 2:1 read-write ratio
        3  - 3:1 read-write ratio
        4  - 3:2 read-write ratio
        5  - 1:1 read-write ratio
        6  - 0:1 read-Non Temporal Write ratio
        7  - 2:1 read-Non Temporal Write ratio
        8  - 1:1 read-Non Temporal Write ratio
        9  - 3:1 read-Non Temporal Write ratio
        10 - 2:1 read-Non Temporal Write ratio (stream triad-like)
                 Same as -W7 but the 2 reads are from 2 different buffers
        11 - 3:1 read-Write ratio (stream triad-like with RFO)
                 Same as -W3 but the 2 reads are from 2 different buffers
        12 - 4:1 read-Write ratio
        21 - 100% read with 2 addr streams - valid with only -o option
        23 - 3:1 read-write ratio with 2 addr streams - valid with only -o option
        27 - 2:1 read-NT write with 2 addr streams - valid with only -o option

To use this, we need to create an input file, just like we did for the loaded latency tests. 

// Create a workload file
# cat <<EOF > PMem_PERTHREAD
#CPUs         Traffic type   seq or rand  buffer size   pmem or dram   pmem path          
0-23,48-71    W2             rand         100000        pmem           /pmemfs0      
EOF

// Run the test
# mlc --loaded_latency -d0 -o./PMem_PERTHREAD -t10 -T -Z

// Using the -w options, you can run additional tests (one test per PMem_PERTHREAD file), such as:
#CPUs         Traffic type   seq or rand  buffer size   pmem or dram   pmem path     
0-23,48-71    R              seq          100000        pmem           /pmemfs0      
0-23,48-71    R              rand         100000        pmem           /pmemfs0 
0-23,48-71    W2             seq          100000        pmem           /pmemfs0      
0-23,48-71    W2             rand         100000        pmem           /pmemfs0 
0-23,48-71    W5             seq          100000        pmem           /pmemfs0      
0-23,48-71    W5             rand         100000        pmem           /pmemfs0     
0-23,48-71    W6             seq          100000        pmem           /pmemfs0      
0-23,48-71    W6             rand         100000        pmem           /pmemfs0      
0-23,48-71    W7             seq          100000        pmem           /pmemfs0      
0-23,48-71    W7             rand         100000        pmem           /pmemfs0      
     

[...]

--- Ramp Up Bandwidth Tests ---
If you take the same approach as 'Max bandwidth' testing, you can achieve ramp up testing by changing the number of threads that MLC uses. eg:

#CPUs         Traffic type   seq or rand  buffer size   pmem or dram   pmem path     
0             R              seq          100000        pmem           /pmemfs0  
0-1           R              seq          100000        pmem           /pmemfs0  
0-3           R              seq          100000        pmem           /pmemfs0  
0-7           R              seq          100000        pmem           /pmemfs0  
0-15          R              seq          100000        pmem           /pmemfs0  
[...]

--- Summary ---
The above approaches can be extended to achieve a multitude of tests including, but not limited to:
  • HyperThreading tests (if Hyperthreading is enabled, run the tests on the first CPU Thread only using -X)
  • Remote PMem testing. ie: Run MLC on Socket0 while accessing the PMem on Socket1. This is useful to observe the UPI and Directory overheads, then you can experiment with the BIOS options such as Snoopy Mode, QoS Recipes, etc to achieve better performance for a given workload. 
  • The workload definition files allow you to perform either DRAM or PMem, or Both in a single test. 
  • Use ' -P CLFLUSH used to evict stores to persistent memory' to perform flushing to observe latency and bandwidths
  • Instead of the loaded_latency_delays, use '-r' to generate random latencies
  • Lots more. MLC has a ton of options. There's lots of fun to be had. 
/Steve
Reply all
Reply to author
Forward
0 new messages