FhGFS Metadata on ZFS

253 views
Skip to first unread message

Trey Dockendorf

unread,
Oct 7, 2013, 6:34:47 PM10/7/13
to fhgfs...@googlegroups.com
My cluster's FhGFS filesystem was recently migrated from Metadata on ext4, and storage on XFS to both metadata and storage on ZFS (using zfsonlinux).  So far things have been stable, but we've seen performance hits on the metadata side.  This migration also included an upgarde from 2011.04 to 2012.10.

Based on issues I've had, I am not convinced the extended attribute metadata storage is faster on ZFS.  My benchmarks before the migration were on cluster read/write and not metadata intensive, but we have users who were doing a large "du" on a directory structure that contained ~20 million files.  Before this migration it completed in < 30 minutes.  Now it takes a little over 5 hours.  I've used some scripts to generate one million 0 byte files on our FhGFS system then perform "find" , a perl based inode count, a "du" then a "rm" to try and have some basis for knowing if changes to metadata increased performance.

Out metadata server:

64GB , 16 core (2 socket).  The zfs filesystem is set to recordsize=4k, with atime=off.  It is currently RAIDZ2, but I'm migrating that to striped mirrors (basically RAID 10) shortly.  The metadata is using extended attributes, which I think is the source of our problem.  Given the cache ability of ZFS, I'm curious if anyone has insight if disabling extended attribute storage would actually show better results.  Since "storeUseExtendedAttribs" can't be changed after the system is populated, is there any unofficial way to convert the extended attributes to flat files?

Metadata and storage nodes are all SL 6.4.  The client nodes are all CentOS 5.7.  Everything is on fhgfs-2012.10.r8

If there's any information I can provide to make this more easy to figure out, please let me know.

The benchmarks I currently run to test metadata are bash executed in Ruby.  Items that do a loop x1000 are run across 10 threads in parallel.  Tests run from DDR IB enabled node to meta and storage all with DDR IB.  The basic idea to create one million empty files across one thousand directories to stress the metadata server.

1) Ruby loop x1000 to execute "mkdir /fhgfs/scratch/benchmark/<I>"
2) Ruby loop x1000 to execute "for i in $(seq 1 1000); do touch /fhgfs/scratch/benchmark/${i}/file_<I>; done"
3) find /fhgfs/scratch/benchmark
4) perl based count loop to count inodes
5) du -s /fhgfs/scratch/benchmark
6) rm -rf /fhgfs/scratch/benchmark/*

Latest benchmarks (values in seconds)

1) 3 
2) 1898
3) 784
4) 2621
5) 1667
6) 3574

While this ran I watched the fhgfs-admon-gui and noticed the "Queued work requests" never went above 2.  I can't seem to get the graphs in admon to scale correctly for past 2 days for "Work Requests" but I recall the number being well over 10,000.

Thanks!
- Trey

Bernd Schubert

unread,
Oct 10, 2013, 5:12:14 AM10/10/13
to fhgfs...@googlegroups.com
Hello Trey,

you could use our patched bonnie to benchmark meta-data performance with
and without extended attributes.

https://bitbucket.org/aakef/bonnie

without EAs:
> /path/to/bonnie/bonnie++ -s0 -n10:200:200:1 -u65535 -g 65535 -d /path/to/meta_storage

with EAs:
> /path/to/bonnie/bonnie++ -s0 -n10:200:200:1 -u65535 -g 65535 -X -d /path/to/meta_storage


Both commands create 10000 files in one directory and write 200 Bytes to
it. Without '-X' real file contents will be written, with '-X' data will
be written as EAs.
Our bonnie repository also includes a parallel_bonnie script, which
allows meta data performance tests with several bonnie instances
(usually one runs this over a network file system such as FhGFS, but you
also use it for local benchmars only).

Btw, the idea of extended attributes is to include the data into the
inode itself and to avoid a dedicated (almost empty) 4K block. I have no
idea if this works with ZFS or if ZFS can inline data anyway.

Hope it helps,
Bernd

Di Pe

unread,
Oct 11, 2013, 12:45:14 PM10/11/13
to fhgfs...@googlegroups.com
Trey, 

This looks like a very interesting project. I'm curious what you were hoping to achieve by putting the metadata server on ZFS. Were you concerned about bitrot on the metadata server? 

Unrelated but out of curiosity: would you be willing to share your hardware configuration details? (drives, controller, cpu, IB cards, switch)

I particularly interested in configurations that support IB as well as 10G (e.g. VPI)

Also did you do any benchmarks with ZFS compression turned on vs off. (all benchmarks you can share would be highly appreciated)

dipe



--
You received this message because you are subscribed to the Google Groups "fhgfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Trey Dockendorf

unread,
Oct 11, 2013, 6:02:56 PM10/11/13
to fhgfs...@googlegroups.com
If anyone else does metadata on ZFS, you MUST set xattr=sa on the zfs filesystem storing metadata.  The default is "on" which stores the xattr in a hidden directory, and has terrible performance.  Setting xattr=sa gives extended attribute setting/getting nearly the same performance as ext4.

For example:

touch 1 million files : before - 1898s , after - 500s
find on 1 million files : before - 784s , after - 9s
perl count_inodes : before - 2621s , after - 182s
du -s : before - 1667s , after - 87s
rm -rf : before - 3574s , after - 951s

Bernd,

Thanks for link to the bonnie++ fork.  I'll try it out as soon as I get some cycles.

This issue on ZoL documents the use of System Attributes for storing the xattr in inodes - https://github.com/zfsonlinux/zfs/issues/443

dipe,

The goal of moving to ZFS was due to a few factors.

1) Backups and recovery - zfs send/recv and snapshots
2) Issues with mdraid and hardware RAID cards.
3) We wanted to upgrade from 2011.04 to 2012.10 without going offline for a long period of time.  I think what we did warrants a write-up, if for no other purpose than the communities amusement and possibly as a guide to others who need to perform similar upgrade operations.

Hardware:

IB Switch - Voltaire Grid Director 2012 (DDR) 
Ethernet Switch - HP E5412zl (using a 4 port 10Gbps CX4 module and a 4 port 10Gbps SFP+ module for FhGFS metadata and storage and rest are 24 port 1Gbps RJ45 modules)

Metadata:
Chassis - Supermicro 1U w/ 10x 2.5" hotswap
Motherboard - Supermicro X9DRW
CPU - 2x Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
RAM - 64GB
Drives - 10x 240GB MLC SSDs (lshw identifies as IRSC21AD240M4)
Controller - LSI SAS2308
IB card - Mellanox MT26428 (IB QDR via QSFP+ and 10GigE via SFP+)

We decided to leave compression off for storage and metadata.  I can send benchmarks once I have them formatted.  The ones I've been doing have been iozone tests from 10 compute nodes in parallel.  I'll combine those with results from bonnie++.

Once the migration is all said and done we'll be running 1 metadata server and 5 storage nodes.  We are thinking of also putting 3x SSDs (mirror + HS) in each storage node to distribute the metadata, but are focused on getting the filesystem back past 100TB for our user's sake.

- Trey
Reply all
Reply to author
Forward
0 new messages