Metadata on an NVMe drives?

611 views
Skip to first unread message

Toby Darling

unread,
Mar 15, 2016, 9:15:49 AM3/15/16
to fhgfs...@googlegroups.com
Hi

beegfs 2015.03.r10-el6.
Scientific Linux 6.7, 2.6.32-573.3.1.el6.x86_64

Does anyone have any experience of using PCIe NVMe drives for metadata
storage?

We have a cluster of 6 servers, all doing storage and 3 also doing
metadata with the metadata stored on NVMe drives. We are intermittently
experiencing some very high numbers (200-500) for Queued Work Requests
and a corresponding drop in performance. There doesn't appear to be a
correlation between the total number of Work Requests and a rise in
Queued Work Requests.

Each meta server has 2 NVMe drives (raid 1):
* 2 are using pairs of Samsung SM951, one ZFS and one ext4
* the other has a pair of Intel 750 using ZFS
(originally all ZFS, but tried ext4 in the course of debugging)
* kernel tuning is as per the wiki.
* the metadata is stored as EAs.
At some point, they all suffer in the same way. Sometimes the problem
only lasts a couple of minutes, sometimes more than 30 minutes.

If anyone's successfully storing metadata on NVMe drives, it'd be great
to hear from you, or indeed, anyone with any ideas. Thanks!

Details of metadata file systems at http://pastebin.com/0jrdanRA

Cheers
Toby
--
Toby Darling, Scientific Computing (2N249)
MRC Laboratory of Molecular Biology
Francis Crick Avenue
Cambridge Biomedical Campus
Cambridge CB2 0QH
Phone 01223 267070

Sven Breuner

unread,
Mar 27, 2016, 6:51:48 PM3/27/16
to fhgfs...@googlegroups.com, Toby Darling
hi toby,

when you use "iostat -mx 1" on the servers in the situations that you describe
below, do you see a high %util value for the NVMe devices so that we could
conclude that something is stalled/delayed at the device level?
if you haven't done this yet, you could also try to increase the tuneNumWorkers
in beegfs-meta.conf to 120 (the default is rather low) to allow for more
parallelism. increasing the connMaxInternodeNum in beegfs-meta.conf to 64 will
also allow more parallelism when the meta needs to talk to storage servers (e.g.
on close() or on stat() open files). but if things are stalled/delayed at the
NVMe device level, this will of course only help for requests that can be served
from the RAM cache.

in case you were not aware of it: you can also use "beegfs-ctl --userstats
--interval=3 --nodetype=meta" (or "beegfs-ctl --clientstats ...") to try to
identify the type of requests that are dominating the workload in such cases.

best regards,
sven
Reply all
Reply to author
Forward
0 new messages