Update on ZFS support?

Harry Mangalam

unread,

Jun 2, 2015, 5:21:48 PM6/2/15

to fhgfs...@googlegroups.com

Hi Bee People,

In a small win for BeeGFS, another group at our institution has decided to move to BeeGFS for their cluster and since they have a lot of experience with ZFS, want to run it over that FS.

From various reports, it seems to work surprisingly well: <http://open-zfs.org/w/images/6/62/TU_Wien_Vienna_-_FhGFS_over_ZFS_Performance.pdf> but is there any update on allowing the quota system to run on BeeGFS / ZFS?

I don't want to encourage them to run that combination if quotas is not going to be operational for a long time.

hjm

Frank Kautz

unread,

Jun 3, 2015, 9:40:53 AM6/3/15

to fhgfs...@googlegroups.com

Hi,

we need a quotactl() support of the underlying file-system for our
implementation. ZFS doesn't support the quotactl(). ZFS has it's one
quota implementation and doesn't use the kernel quota stuff.

Now, the good news it looks like the zfs people working at the
quotactl(), but I don't know how long they need for the implementation.
https://github.com/zfsonlinux/zfs/issues/2922#issuecomment-84444700

kind regards,
Frank

> --
> You received this message because you are subscribed to the Google
> Groups "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to fhgfs-user+...@googlegroups.com
> <mailto:fhgfs-user+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Trey Dockendorf

unread,

Jun 4, 2015, 3:47:40 PM6/4/15

to fhgfs...@googlegroups.com

Are you looking for quota reporting or enforcement with ZFS on BeeGFS?

I have a series of scripts that produce a quota report for BeeGFS storage systems running ZFS. I'll try to get the scripts to a point where they can at least be published on github. The basic idea is the script collects data from all BeeGFS storage systems using SSH + executing "zfs userspace" and then a python script parses the data to produce a report. The system executing the collection needs root level access via SSH keys. We have the collection performed by one of our administrative systems on the cluster. This method is far from ideal, but it does allow us to view usage by user and by group across 7 storage nodes.

We have a locally patched version of the quota application that supports ZFS. One of my colleagues wrote the patch, I'll see if I can get him to produce a patch. The patch allows for user quotas to be queried via the "quota" command for our /home filesystem which is NFS with the storage using ZFS.

- Trey

harry mangalam

unread,

Jun 4, 2015, 4:01:00 PM6/4/15

to fhgfs...@googlegroups.com, Trey Dockendorf

Hi Trey,

Thanks very much for the info. We already have Robinhood scans running on our filesystems to do weekly accounting, so I'm not sure this would provide much more actionable info (if I understand what you're saying)..

The new BeeGFS sysadmins would like to be able to use enforceable quotas, but even if they can't, the Robinhood results would be a good 'surveillance' tool.

I'll have to see what's acceptable to them.

hjm

---

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine

[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487

415 South Circle View Dr, Irvine, CA, 92697 [shipping]

MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

---

Thanks, but I will not be following you on Twitter.

Trey Dockendorf

unread,

Jun 9, 2015, 12:22:24 PM6/9/15

to harry mangalam, fhgfs...@googlegroups.com

I've put my code on Github, https://github.com/treydock/fhgfs-ctl-zfs-getquota. I don't know how adaptable it is and may be redundant if you already have scans running to collect usage information.

- Trey

nathan....@uci.edu

unread,

Oct 26, 2015, 6:23:34 PM10/26/15

to beegfs-user, hjman...@gmail.com

Hi Trey,

I'm the one evaluating beegfs on the other system at UC Irvine. We have a 60-disk (6TB each) JBOD for storage and a separate metadata node with 6 SSDs.

There are MANY potential knobs to turn, especially on the ZFS end. Can you provide any insight into how you arranged your pools? I noticed annoyingly high space usage on the metadata pool when using zfs directly (about 50GB of metadata per TB of actual data), but something more reasonable when using an ext4-formatted zvol on the same zpool (about 5GB/TB). I have xattr=sa, but think the ashift=12 may be requiring each 512B of metadata info take up 4K on SSD. With the ext4 zvol (128k blocksize, lz4 compression), these may get packed together as "regular data". This is just speculation at the moment.

I also tried setting some zfs module parameters for the storage pool(s) that are supposed to be appropriate for a Lustre OST on ZFS. Initial performance tanked, so I am backing them off incrementally.

Thanks,

Nate Crawford

--

Dr. Nathan Crawford nathan....@uci.edu

Modeling Facility Director

Department of Chemistry

1102 Natural Sciences II Office: 2101 Natural Sciences II

University of California, Irvine Phone: 949-824-4508

Irvine, CA 92697-2025, USA

harry mangalam

unread,

Oct 26, 2015, 6:46:57 PM10/26/15

to nathan....@uci.edu, beegfs-user

On Monday, October 26, 2015 03:23:33 PM nathan....@uci.edu wrote:

> Hi Trey,

>

> I'm the one evaluating beegfs on the other system at UC Irvine. We have a

> 60-disk (6TB each) JBOD for storage and a separate metadata node with 6

> SSDs.

Some general comments:

- packing all your storage on 1 jbod chassis is not helping - the way you get 'better than zfs' performance out of a zfs underlay is to use multiple storage servers so you've got mult machines, mult CPUs, mult IO channels effectively bonded.

In one chassis, you're effectively guaranteed to have worse than ZFS-only performance.

Re: the metadata storage, that sounds enormously high. For our largest BeeGFS (XFS underlay -/dfs1 - 368TB of 464TB used), the BeeGFS MD server runs the MD storage on an ext4 fs that only uses ~3GB of storage on a fast RAID10. that's for ~9.3M files.

So that's even MUCH lower than your low number.

I'm having trouble following how you're setting up the MD system... are you saying that you're running the MD filesystem on an ext4 FS over a LVM on ZFS ..? that sounds .. suboptimal..

hjm

XSEDE 'Campus Champion' - ask me about your research computing needs.

Map to Office | Map to Data Center Gate

[the command line is the new black]

---

Trey Dockendorf

unread,

Oct 26, 2015, 9:06:04 PM10/26/15

to fhgfs...@googlegroups.com, Harry Mangalam

For the pools, I am extremely paranoid regarding resilver times so I never put more than 10 disks into a single vdev (RAIDZ2) and always have 1 hot spare per vdev. Since you have 60 disks to work with, that makes things tricky. Multiple vdevs in a single pool are striped, sort of like RAID60. On my systems we have 24 disks and I put 2x RAIDZ2 vdevs with 10 disks each and make 2 disks spares. The other 2 disks are left outside ZFS in case we ever need to replace them with SSDs to distribute metadata across storage nodes. Below is zpool layout of one of my storage nodes [1] and my single metadata node [2]. As for ZFS filesystem, I have found out the hard way that it's best to never use the primary filesystem for anything, so once you create your zpool, create child filesystem that actually contain your data. This allows for easier ZFS send/receive for things like backups and/or replacing servers. So I have tank and then I create tank/fhgfs/storage for my storage nodes (or tank/beegfs/storage if your on 2015.03).

If you don't have SSDs on your storage server for intent log and read cache, I'd recommend adding some. ZFS will use the data drives for intent log if not given a dedicated intent log. That will slow writes down dramatically. The read cache (L2ARC) on our systems is very underutilized, but the ARC cache ends up being used very heavily on our systems so all new systems have at minimum 128GB of RAM (~30-40% goes to ARC). We also set tuneRemoteFSync=false on our FhGFS/BeeGFS clients.

Our storage systems have no special tuning done in ZFS besides compression=lz4 (used on metadata too). The defaults have served us well thus far. We do limit ARC size to avoid high load taking the system to OOM conditions.

For metadata we set xattr=sa, recordsize=4k and zfs_prefetch_disable=1 (kernel module option). Our zpool for metadata is three mirrors (6 SSDs total), each is 240GB. Our metadata is only at about 50% capacity with roughly 200TB used across 7 storage nodes. Currently we have 216GB used (rest is taken up by snapshots), so that works out to about 1.08GB of metadata per TB of data. Your 50GB per TB is definitely not good. I'd maybe look at lowering your recordsize, and ensuring you have xattr=sa set before any data is added. Things like recordsize and xattr won't effect existing data, so only way to ensure old data uses them is to zfs send/receive after properties are set. We are still on 0.6.3 of ZoL, so hopefully what your seeing is not some kind of regression, assuming your using something newer than 0.6.3.

I recommend using ashift=12. The main reason is drive interoperability should drives be 4K sectors, or be replaced by 4K sector drives. There is some extra space 'lost' if your set ashift=12 on 512byte drives, but I've always found it to be an acceptable loss given I want every bit of performance out of drives, should they end up being 4K drives.

I use a few OS level tuning adjustments on our storage systems via tuned [3]. They are very similar to tuned profiles posted a while ago by someone on this list. Main difference is leaving drives as using "noop" scheduler since ZFS has its own internal IO scheduler. Having ample RAM and CPUs helps with ZFS. Having SSDs for intent log and l2arc also helps.

- Trey

[1]:

config:

NAME STATE READ WRITE CKSUM

tank ONLINE 0 0 0

raidz2-0 ONLINE 0 0 0

d01 ONLINE 0 0 0

d02 ONLINE 0 0 0

d03 ONLINE 0 0 0

d04 ONLINE 0 0 0

d05 ONLINE 0 0 0

d07 ONLINE 0 0 0

d08 ONLINE 0 0 0

d09 ONLINE 0 0 0

d10 ONLINE 0 0 0

d11 ONLINE 0 0 0

raidz2-1 ONLINE 0 0 0

d13 ONLINE 0 0 0

d14 ONLINE 0 0 0

d15 ONLINE 0 0 0

d16 ONLINE 0 0 0

d17 ONLINE 0 0 0

d18 ONLINE 0 0 0

d19 ONLINE 0 0 0

d20 ONLINE 0 0 0

d21 ONLINE 0 0 0

d22 ONLINE 0 0 0

logs

mirror-2 ONLINE 0 0 0

ssd0-part2 ONLINE 0 0 0

ssd1-part2 ONLINE 0 0 0

cache

ssd1-part3 ONLINE 0 0 0

ssd0-part3 ONLINE 0 0 0

spares

d12 AVAIL

d23 AVAIL

[2]:

config:

NAME STATE READ WRITE CKSUM

tank ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

ssd03 ONLINE 0 0 0

ssd04 ONLINE 0 0 0

mirror-1 ONLINE 0 0 0

ssd05 ONLINE 0 0 0

ssd06 ONLINE 0 0 0

mirror-2 ONLINE 0 0 0

ssd07 ONLINE 0 0 0

ssd08 ONLINE 0 0 0

spares

ssd09 AVAIL

[3]: https://github.com/treydock/site_tuned/tree/master/files/fhgfs-store-zfs

--
You received this message because you are subscribed to a topic in the Google Groups "beegfs-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fhgfs-user/Q-2rSNIsPM0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fhgfs-user+...@googlegroups.com.

nathan....@uci.edu

unread,

Oct 26, 2015, 10:32:17 PM10/26/15

to beegfs-user, hjman...@gmail.com

Thanks! We are basically on the same page, but since our use case will be non-backed-up storage/scratch, I was planning on six 10-disk raidz2 vdevs.

I'm trying to determine if it makes any sense to split it into multiple pools. Normally, this would be silly; it would create a management headache, restrict IO distribution flexibility, require splitting L2ARC devices, etc.. As a BeeGFS storage backend, however, it could:

1) Allow more simultaneous independent transaction groups, which could help with many simultaneous clients (but I'm skeptical)

2) Keep blocks of disks independent, which could aid future incremental disk upgrades

3) Better match storage target capacity and performance to the existing arrays in the cluster, which we plan on adding to the BeeGFS setup over time.

I haven't seen much benefit with #1, but I'm still testing various pool sizes. With #2, we could migrate data off and upgrade 10 disks at a time, but this might be as much of a pain as popping in each new larger disk and waiting to resilver. It is more likely that we would just buy a new JBOD at that point. #3 can be approximated by a single pool with multiple zfs datasets as beegfs targets. This seems to have the same overall performance as a single target per 60-disk pool.

The metadata disk situation is much more annoying. We have the same basic layout (6 200GB SSDs in a pool as 3 mirrored vdevs), using xattr=sa and lz4 compression, but haven't disabled prefetch yet. Changing record size did not affect the usage blowup. We're running Scientific Linux (CERN) 6.7, but with 2.6.32-504.30.3.el6.x86_64 to be compatible with some hardware drivers (Intel Infiniband). The latest ZFS rpms in the repo that work with that kernel are 0.6.4, but it is probably worth at least trying the EL6.7 kernel and ZFSonLinux 0.6.5 on the metadata node. It is good to hear that someone else has NOT had the same issue.