optimal config for NVME SSDs for BeeGFS metadata

Harry Mangalam

unread,

Mar 9, 2017, 9:16:00 AM3/9/17

to beegfs-user

Hi all,

We're planning a new BeeGFS with buddy MD servers and was hoping that we could use NVME devices in RAID1 for the metadata.

Is there a best practices approach to this? RAID1 isn't particularly compute heavy, so unless the controller is doing lots of other fancy things, it may not be worth while using a hardware approach (and in fact the only NVME RAID controller that we've seen is brand new and we're not keen on using that level of cutting edge).

the alternatives are to:

- use mdadm to put 2 NVMEs into RAID1 and then use ext4 on top as we would normally.

- use ZFS in the same config

However, I haven't seen any info on how ZFS works as a metadata filesystem with BeeGFS and I'm concerned that it will not have nearly the performance that mdadm/ext4 will.

Since we'll be using buddy MD servers (in different buildings), each with RAID1, the survivability of the system isn't too much of a worry, but the performance is - this will be a very heavily used FS.

We have few concerns about the performance using ZFS on the storage targets - we're using that config now; just the suitability for metadata.

Thanks in advance for any insights on this.

Harry

John Hanks

unread,

Mar 9, 2017, 10:08:43 AM3/9/17

to beegfs-user

We have an nvme box with 12 drives using zfs with no raid as a shared NFS scratch space. We never tested beyond that first setup because it was already so much faster than qdr IB that it didn't seem necessary. However, I recently added drives and will be trying some assorted raid types so I also would be interested in hearing about anyone else's experiences and NVMe config hints in general. Fwiw, we are running gzip-9 compression on the scratch space and still are generally network bound for our random workloads. NVMe is powerful magic.

jbh

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

‘[A] talent for following the ways of yesterday, is not sufficient to improve the world of today.’

- King Wu-Ling, ruler of the Zhao state in northern China, 307 BC

Sven Breuner

unread,

Mar 12, 2017, 3:53:26 PM3/12/17

to fhgfs...@googlegroups.com, Harry Mangalam

Hi Harry,

you might want to check out this post from Trey Dockendorf where he makes a
recommendation ("set xattr=sa on the zfs filesystem storing metadata") and also
mentions how he benchmarked single-process metadata performance:

https://groups.google.com/d/msg/fhgfs-user/Dvy8SMD56n8/TfO6zFs57zUJ

To test multi-client performance, you might want to look at this whitepaper:
http://www.cloud.fraunhofer.de/content/dam/allianzcloud/de/documents/FhGFS%20-%20Performance%20at%20the%20maximum.pdf
It is already several years old, but at the end you can see examples for how to
test metadata performance with "mdtest". (I would recommend to add also "-r" to
the mdtest arguments to have mdtest remove the files in parallel.)

I assume you don't have NVMe drives also as storage targets, do you? If you do,
you might want to check out this presentation from ScalableInformatics' Joe
Landman from the BeeGFS User Meeting last year. He was actually able to fully
saturate a 100Gbit link with 2x zfs RAIDz2 on 12 NVMe drives each. (This part
starts at slide 33.)
http://files.scalableinformatics.com/private/BeeGFS/UM2016/Extreme_performance_storage_v4.pdf

Best regards,
Sven

Jens Dreger

unread,

Mar 12, 2017, 4:45:25 PM3/12/17

to fhgfs...@googlegroups.com

Hi Harry,

I have recently been running mdtest with ext4 and ZFS as storage
backend for buddy mirrored meta servers. I started out with ZFS on
SATA SSDs but ran into severe stability problems so I switched to ext4
and replaced the SATA SSDs with ram block devices, since I wanted to
rule out any compatibility issues between controllers and SSD drives.
Also the amount of testing would have worn out my SSDs by now, I have
mdtest running in a loop now forever, since the setup is still not
entirely stable (one beegfs-meta process crashed tonight, don't know
why, yet).

Here are mdtest results for ext4:

Command line used: mdtest -I4 -z5 -b4 -d /bee/z10/scratch-mirror/mdtest -i3
Path: /bee/z10/scratch-mirror
FS: 434.3 GiB Used FS: 0.0% Inodes: 0.0 Mi Used Inodes: -nan%

256 tasks, 1397760 files/directories

SUMMARY: (of 3 iterations)
Operation Max Min Mean Std Dev
--------- --- --- ---- -------
Directory creation: 8729.058 8591.694 8679.750 62.414
Directory stat : 95917.047 93721.914 94921.107 907.611
Directory removal : 9088.270 8846.302 8952.054 101.104
File creation : 41125.743 40297.799 40830.648 377.507
File stat : 89228.039 83994.154 87107.150 2248.940
File read : 52829.965 51182.941 51797.619 734.396
File removal : 49413.190 40836.537 44411.416 3643.878
Tree creation : 569.554 521.108 546.375 19.833
Tree removal : 154.461 147.239 151.231 2.997

and here the numbers for ZFS on the same meta servers (everything else identical):

SUMMARY: (of 3 iterations)
Operation Max Min Mean Std Dev
--------- --- --- ---- -------
Directory creation: 4780.245 4400.118 4643.899 172.783
Directory stat : 77724.168 61673.411 68871.645 6656.285
Directory removal : 4366.347 3907.009 4079.198 204.385
File creation : 16268.990 15520.098 15915.818 307.211
File stat : 82472.745 56643.299 68033.440 10762.996
File read : 28045.776 26163.479 27002.837 781.812
File removal : 17513.640 13987.945 16170.966 1557.191
Tree creation : 494.419 449.561 475.191 18.865
Tree removal : 163.560 116.279 146.617 21.501

ZFS has xattr=sa set. I've seen this performance ratio between ext4
and ZFS in other setups, too. The reason I'd still prefer ZFS over
ext4 is that I want to use snapshots for backup.

The test setup right now consists of 4 meta servers with buddy
mirroring activated, so there are effectively only two. I'm planing to
increase this number soon. mdtest is running on 32 clients with 8
cores each, so 256 processes alltogether. Interconnect is Mellanox QDR
IB. Nothing else is running on these machines. Eventually I also plan
to use NVMe SSDs, when everything else works fine ;)

I was gonna ask a question about this myself: while the md-mirrored
directory was automatically repaired after the crash, another
non-mirrored directory was not. While this is to be expected, I now
have an "half existing" directory entry:

dreger@z001:~> ls /bee/z10
ls: cannot access /bee/z10/scratch: No such file or directory
scratch scratch-mirror

Since I had this a few times bofore on BeeGFS: is there a way to
remove such broken directory entries?

Regards,

Jens.

> --
> You received this message because you are subscribed to the Google Groups
> "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to fhgfs-user+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--

Jens Dreger Freie Universitaet Berlin
dre...@physik.fu-berlin.de Fachbereich Physik - ZEDV
Tel: +49 30 83854774 Arnimallee 14
Fax: +49 30 838454774 14195 Berlin

harry mangalam

unread,

Mar 13, 2017, 11:51:33 AM3/13/17

to Sven Breuner, fhgfs...@googlegroups.com

Hi Sven,

thanks for reminding me about Trey's posts. I read them when they 1st came out and actually corresponded a bit with him when we were transitioning to ZFS as the storage underlay. I'll go back and read them again with a view to the metadata.

hjm

--

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine

[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487

415 South Circle View Dr, Irvine, CA, 92697 [shipping]

XSEDE 'Campus Champion' - ask me about your research computing needs.

Map to MSTB| Map to Data Center Gate

harry mangalam

unread,

Mar 13, 2017, 12:53:54 PM3/13/17

to fhgfs...@googlegroups.com, Jens Dreger

Hi Jens,

thanks very much for this.

To reiterate: I think the detailed stats below were made based on RAM block devices (not SSDs) and the ext4 FS was created via mdadm? Or were they made with some other mechanism or layer (LVM2?)

How many block devs were used and in what RAID config? I assume RAID1 but that might not be right.

The ZFS system presumably used the same RAM block devs, but controlled by ZFS.

Both would not involve any discrete disk controller since they're RAM block devs.

Your idea about using snapshots for backups is a really good one. That's something ext4 won't do.

(Repeating myself I guess), Has anyone used NVME SSDs as block devs for ZFS? It may be that they're so fast that increasing speed isn't an issue since they can saturate any external IO port being used.

We'll be taking delivery of some of this hardware, so I may be able to report on it soon.

hjm

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine

Jens Dreger

unread,

Mar 13, 2017, 1:27:14 PM3/13/17

to harry mangalam, fhgfs...@googlegroups.com

Hi Harry!

On Mon, Mar 13, 2017 at 09:53:50AM -0700, harry mangalam wrote:
> To reiterate: I think the detailed stats below were made based on RAM block
> devices (not SSDs) and the ext4 FS was created via mdadm? Or were they made
> with some other mechanism or layer (LVM2?)
>
> How many block devs were used and in what RAID config? I assume RAID1 but that
> might not be right.

I know, it sounds weird, but there is no RAID in this setup. I just
wanted to have the blockdevices as fast as possible to see how much
performace at most I could expect from NVMe cards. Being in RAM they
will most likely fail in the exact same moment anyway (e.g. at
shutdown :)

So basically all I do for ext4 is:

modprobe brd rd_nr=1 rd_size=33554432
mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/ram0
mount /dev/ram0 /ram0

and for ZFS:

modprobe brd rd_nr=1 rd_size=33554432
zpool create -f ram0 /dev/ram0
zfs set xattr=sa ram0

If buddy mirroring and recovery from meta server failure works great,
I might not even create any RAID at all, since the mirrorgroup should
handle this. One reason I came up with this idea is that I have some
cloudedge servers I could use as meta servers that can take only one
PCIe card, so I won't be able to install two NVMe cards in one server
to create a RAID1.

> The ZFS system presumably used the same RAM block devs, but controlled by ZFS.
> Both would not involve any discrete disk controller since they're RAM block
> devs.

Exactly.

Jens.

John Hanks

unread,

Mar 14, 2017, 8:58:23 AM3/14/17

to fhgfs...@googlegroups.com

On Mon, Mar 13, 2017 at 7:53 PM harry mangalam <hjman...@gmail.com> wrote:

(Repeating myself I guess), Has anyone used NVME SSDs as block devs for ZFS? It may be that they're so fast that increasing speed isn't an issue since they can saturate any external IO port being used.

We just got a few of the Intel P3700 2 TB AIC form factor drives. Here's about a crude a view as you can get:

[root@kw14427 ~]# zpool create datapoolnvme /dev/nvme0n1 /dev/nvme1n1

[root@kw14427 ~]# zfs create -o acltype=posixacl -o xattr=sa -o atime=off datapoolnvme/localnvme

[root@kw14427 ~]# zfs create -o acltype=posixacl -o xattr=sa -o atime=off -o mountpoint=/localnvme/scratch datapoolnvme/localnvme/scratch

[root@kw14427 ~]# cd /localnvme/scratch/

[root@kw14427 scratch]# time sh -c "dd if=/dev/zero of=./testfile count=$((1024*512)) bs=1M; sync"

524288+0 records in

524288+0 records out

549755813888 bytes (550 GB) copied, 235.248 s, 2.3 GB/s

real 3m59.295s

user 0m0.390s

sys 2m59.528s

Some sampling from dstat during the writes:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--

usr sys idl wai hiq siq| read writ| recv send| in out | int csw

1 11 86 2 0 0|5632B 1960M| 602B 362B| 0 0 | 57k 204k

1 10 87 2 0 0| 0 2324M|1456B 858B| 0 0 | 59k 185k

0 3 94 2 0 0| 0 2145M|1329B 362B| 0 0 | 55k 69k

0 3 94 2 0 0| 0 2330M| 520B 362B| 0 0 | 59k 73k

1 18 79 1 0 0|5120B 2220M| 70B 362B| 0 0 | 80k 322k

0 3 94 3 0 0| 0 2168M| 198B 378B| 0 0 | 56k 69k

0 3 94 2 0 0| 0 2266M| 326B 362B| 0 0 | 56k 71k

1 12 86 2 0 0|5632B 2069M| 70B 362B| 0 0 | 58k 220k

1 9 88 2 0 0| 0 2297M| 782B 378B| 0 0 | 59k 171k

0 3 94 2 0 0| 0 2250M| 426B 362B| 0 0 | 55k 71k

0 3 94 2 0 0| 0 2171M| 134B 362B| 0 0 | 55k 69k

A second simultaneous dd process just splits the available bandwidth.

This is CentOS 7.2, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz and 128 GB of RAM in the workstation.

[root@kw14427 ~]# nvme list

Node SN Model Version Namespace Usage Format FW Rev

---------------- -------------------- ---------------------------------------- -------- --------- -------------------------- ---------------- --------

/dev/nvme0n1 CVFT6162005P2P0EGN INTEL SSDPEDMD020T4 1.0 1 2.00 TB / 2.00 TB 512 B + 0 B 8DV10171

/dev/nvme1n1 CVFT609600112P0EGN INTEL SSDPEDMD020T4 1.0 1 2.00 TB / 2.00 TB 512 B + 0 B 8DV10171

jbh

harry mangalam

unread,

Mar 14, 2017, 9:59:17 AM3/14/17

to fhgfs...@googlegroups.com, John Hanks

that's pretty impressive for a couple of devices. Not quite QDR-saturating, but for metadata it's at least 10x what's needed.

Oh that I could make real storage out of nvme's like the Scalable informatics device thta Sven pointed to. And it's a very good doc, even without Joe's trenchant/truculent commentary. Really worth looking at to squeeze more IO out of what you've got.

Thanks, John!

hjm

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine

Reply all

Reply to author

Forward