How to benchmark the random read bandwidth of PM device.

124 views
Skip to first unread message

李鹏

unread,
May 23, 2019, 10:52:06 AM5/23/19
to pmem
To everyone who will help:

   We have installed the pm device and use ext4-dax to access them.
   1. We use FIO and set the engine to be libpmem to randomly read a file in ext4-dax file system. The pcm tools show that the pmem read bandwidth is about 4.9GB/s.
   2. However, we write another simple test demo to mmap the pmem file and start several threads to send the random read quests, it shows only about 2.9GB/s for one PM  device.

We want to know why we can not reach the 4.9GB/s for random read requests, just as FIO did. The read size is 4KB for one read request.

-------------------------------------------------------------------------------------------
Our demo codes are listed below:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#ifndef _WIN32
#include <unistd.h>
#else
#include <io.h>
#endif
#include <string.h>
#include <libpmem.h>
#include <pthread.h>

#define BUF_LEN 4096
char buf[20][BUF_LEN];
char *pmemaddr;
size_t mapped_len;

static void* fn(void *arg)
{
int i = *(int *)arg;
int offset = 0;
long total = 1024*1024*100;
printf("I am in thread %d!\n", i);

for (long m = 0; m < total; ++m) {
offset = rand() % (mapped_len-BUF_LEN);
memcpy(buf[i], pmemaddr+offset, BUF_LEN);
if (m % 10240 == 0) {
//printf("offset: %d, thread %d, %d * 4k\n", offset, i, m);
}
}

printf("I am in thread %d Finished...!\n", i);
return NULL;
}

int
main(int argc, char *argv[])
{
int srcfd;
int is_pmem;
int cc;
pthread_t threads[100];
int NUM_THREADS = 1;

if (argc < 2) {
fprintf(stderr, "usage: %s file num_threads \n", argv[0]);
exit(1);
}
if (argc >= 3) {
NUM_THREADS = atoi(argv[2]);
}

printf("Begin to open file: %s with %d threads\n", argv[1], NUM_THREADS);
/* create a pmem file and memory map it */
if ((pmemaddr = pmem_map_file(argv[1], 0, 0,
0666, &mapped_len, &is_pmem)) == NULL) {
perror("pmem_map_file");
exit(1);
}
printf("map file success, file: %s, size: %ld, is_pmem: %d\n", argv[1], mapped_len, is_pmem);

for (int i = 0; i < NUM_THREADS; ++i) {
pthread_create(threads+i, NULL, fn, &i);
}

printf("Waiting for the threads to finish...\n");
for (int i = 0; i < NUM_THREADS; ++i) {
pthread_join(threads[i], NULL);
}
printf("All Finished!\n");
return 0;
}


Anton Gavriliuk

unread,
May 23, 2019, 11:11:46 AM5/23/19
to 李鹏, pmem
Hi 

> 1. We use FIO and set the engine to be libpmem to randomly read a file in ext4-dax file system. The pcm tools show that the pmem read bandwidth is about 4.9GB/s.

How may FIO "--numjobs" did you use ??

For --numjobs=1 I'm able to get 3 MB/s.

With higher number of FIO theads in the test it may not be real pmem performance/traffic due to high CPU L3/L2 cache hits.

Anton 

чт, 23 мая 2019 г. в 17:52, 李鹏 <lipen...@bytedance.com>:
--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To post to this group, send email to pm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/e16e7b06-97bd-4a28-8240-dc960dd1a72e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anton Gavriliuk

unread,
May 23, 2019, 12:41:12 PM5/23/19
to 李鹏, pmem
>  For --numjobs=1 I'm able to get 3 MB/s.

Sorry, 3 GB/s.

Anton

чт, 23 мая 2019 г. в 18:11, Anton Gavriliuk <antos...@gmail.com>:

李鹏

unread,
May 24, 2019, 3:54:26 AM5/24/19
to pmem
We set FIO threads to be 8.

Now we have another update.
I modify the above codes, which starts N threads where each thread opens a separate pm file. For example, thread t1 to open file1, t2 to open file2. We successfully make it to be 5GB/s for one PM device.

However, if we only use one PM file shared with all threads, the throughput is 3.1GB/s.

We also tried another way, mmap a big file, and split the address space into N chunks, where each thread only reads data from one separate chunk. The throughput is also 5GB/s.

The above code also has a severe problem owing to the use of rand. This "rand" functions will use locks for multi-threading. We have created a random function with no lock as follows:

//gcc -shared -fPIC  intelrand.c -O3 -o intelrand.so -mrdrnd

#include <immintrin.h>
#include <stdlib.h>

#define _GNU_SOURCE

long int randompersonal(void){
unsigned long long rnd;
__builtin_ia32_rdrand64_step(&rnd);
return (long int)(rnd & 0x7FFFFFF);
}

This code comes from my partner, liuxian. 




在 2019年5月24日星期五 UTC+8上午12:41:12,Anton Gavriliuk写道:
>  For --numjobs=1 I'm able to get 3 MB/s.

Sorry, 3 GB/s.

Anton

чт, 23 мая 2019 г. в 18:11, Anton Gavriliuk <antos...@gmail.com>:
Hi 

> 1. We use FIO and set the engine to be libpmem to randomly read a file in ext4-dax file system. The pcm tools show that the pmem read bandwidth is about 4.9GB/s.

How may FIO "--numjobs" did you use ??

For --numjobs=1 I'm able to get 3 MB/s.

With higher number of FIO theads in the test it may not be real pmem performance/traffic due to high CPU L3/L2 cache hits.

Anton 

чт, 23 мая 2019 г. в 17:52, 李鹏 <lipe...@bytedance.com>:
To unsubscribe from this group and stop receiving emails from it, send an email to pm...@googlegroups.com.

李鹏

unread,
May 24, 2019, 3:59:07 AM5/24/19
to pmem
We have 4 PMs, interleaved into one region.

If we set the numjobs=1, we have 2.4GB/s, where each PM gives about 600MB/s.
Chunk is 4096, randread, iodepth=1, libpmem engine


在 2019年5月24日星期五 UTC+8上午12:41:12,Anton Gavriliuk写道:
>  For --numjobs=1 I'm able to get 3 MB/s.

Sorry, 3 GB/s.

Anton

чт, 23 мая 2019 г. в 18:11, Anton Gavriliuk <antos...@gmail.com>:
Hi 

> 1. We use FIO and set the engine to be libpmem to randomly read a file in ext4-dax file system. The pcm tools show that the pmem read bandwidth is about 4.9GB/s.

How may FIO "--numjobs" did you use ??

For --numjobs=1 I'm able to get 3 MB/s.

With higher number of FIO theads in the test it may not be real pmem performance/traffic due to high CPU L3/L2 cache hits.

Anton 

чт, 23 мая 2019 г. в 17:52, 李鹏 <lipe...@bytedance.com>:
To unsubscribe from this group and stop receiving emails from it, send an email to pm...@googlegroups.com.

Piotr Balcer

unread,
May 24, 2019, 4:03:12 AM5/24/19
to pmem
Try using device dax for your measurements to exclude the overheads of the file system from the benchmark. You could also touch every single page ahead of the benchmark, so that there are no page faults when you measure the throughput.

李鹏

unread,
May 24, 2019, 4:39:58 AM5/24/19
to pmem
we change it to be device-dax, yes, it reaches 5GB/s for one PM device. 
I do not know why. In my opinion, we mmap a file ( a file of ext4-dax), and memcpy the data from the mapped address. Do it has the cost of file system?



在 2019年5月24日星期五 UTC+8下午4:03:12,Piotr Balcer写道:

Piotr Balcer

unread,
May 24, 2019, 4:48:17 AM5/24/19
to pmem
The only cost of the file system are the initial page faults that need to be done on the memory mapped file. Once they are done, there's no overhead.
In other words, after you write at least once to every page, there's no more overhead from the FS.
Try running your benchmark (but reduce the total to something smaller) under perf, and you will see that a significant of the portion is spent in the kernel in the page_fault() function.

Piotr

Anton Gavriliuk

unread,
May 24, 2019, 6:15:50 AM5/24/19
to Piotr Balcer, pmem
>  Try running your benchmark (but reduce the total to something smaller) under perf, and you will see that a significant of the portion is spent in the kernel in the page_fault() function.

If you use fsdax mode and planning use libpmem/mmap(), ext4/xfs filesystems must be created in non-default way for avoiding page_faults(),

ext4

mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0
mount -o dax /dev/pmem0 /mnt/dax

xfs

$ mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
$ mount -o dax /dev/pmem0 /mnt/dax
$ xfs_io -c "extsize 2m" /mnt/dax

Anton

пт, 24 мая 2019 г. в 11:48, Piotr Balcer <ppbb...@gmail.com>:
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.

To post to this group, send email to pm...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages