If anyone else does metadata on ZFS, you MUST set xattr=sa on the zfs filesystem storing metadata. The default is "on" which stores the xattr in a hidden directory, and has terrible performance. Setting xattr=sa gives extended attribute setting/getting nearly the same performance as ext4.
For example:
touch 1 million files : before - 1898s , after - 500s
find on 1 million files : before - 784s , after - 9s
perl count_inodes : before - 2621s , after - 182s
du -s : before - 1667s , after - 87s
rm -rf : before - 3574s , after - 951s
Bernd,
Thanks for link to the bonnie++ fork. I'll try it out as soon as I get some cycles.
dipe,
The goal of moving to ZFS was due to a few factors.
1) Backups and recovery - zfs send/recv and snapshots
2) Issues with mdraid and hardware RAID cards.
3) We wanted to upgrade from 2011.04 to 2012.10 without going offline for a long period of time. I think what we did warrants a write-up, if for no other purpose than the communities amusement and possibly as a guide to others who need to perform similar upgrade operations.
Hardware:
IB Switch - Voltaire Grid Director 2012 (DDR)
Ethernet Switch - HP E5412zl (using a 4 port 10Gbps CX4 module and a 4 port 10Gbps SFP+ module for FhGFS metadata and storage and rest are 24 port 1Gbps RJ45 modules)
Metadata:
Chassis - Supermicro 1U w/ 10x 2.5" hotswap
Motherboard - Supermicro X9DRW
CPU - 2x Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
RAM - 64GB
Drives - 10x 240GB MLC SSDs (lshw identifies as IRSC21AD240M4)
Controller - LSI SAS2308
IB card - Mellanox MT26428 (IB QDR via QSFP+ and 10GigE via SFP+)
We decided to leave compression off for storage and metadata. I can send benchmarks once I have them formatted. The ones I've been doing have been iozone tests from 10 compute nodes in parallel. I'll combine those with results from bonnie++.
Once the migration is all said and done we'll be running 1 metadata server and 5 storage nodes. We are thinking of also putting 3x SSDs (mirror + HS) in each storage node to distribute the metadata, but are focused on getting the filesystem back past 100TB for our user's sake.
- Trey