Performance

Paul

unread,

Dec 9, 2011, 11:13:11 AM12/9/11

to XtreemFS

Hi All,

I work for a scale out block storage vendor and I'm very interested in
testing this filesystem. We have been testing GPFS, Gluster, and ZFS
(for performance reasons). What I'd like to test is read and write
performance and I have a very sizable lab to test this. We have 10GbE
connectivity throughout and plenty of servers to test with. Are there
any sort of best practices I should follow to get optimal
performance? What would be considered optimal performance? I'd be
happy to blog/document my results once I get further. Any help is
much appreciated.

Thanks,

Paul

Michael Berlin

unread,

Dec 9, 2011, 5:30:09 PM12/9/11

to xtre...@googlegroups.com

Hi Paul,

> What I'd like to test is read and write performance
> and I have a very sizable lab to test this. We have 10GbE connectivity
> throughout and plenty of servers to test with.

That sounds great :-) Please keep the following in mind: We don't consider
XtreemFS as a parallel file system which puts the focus on leveraging the
maximum theoretical throughput of a homogenous cluster. Instead we see its
main application as a distributed file system over WAN links or nowadays
called a cloud file system :-)
But still, it also works well in every LAN environment. Just keep in mind we
did not optimize it for squeezing out even the last MB/s of your hardware.

I just made a quick test with the 1.3.1 Fuse Client (mount.xtreemfs) with
enabled asynchronous writes on my laptop against an OSD in our LAN and I
succeeded to max out my 1GbE port with ~100MB/s. [5] Some time ago we also
benched our old Java client writing to multiple OSDs via Infiniband [1].
Limited by the hard disk of a single OSD we reached more than 1100 MB/s
writing to 25 OSDs in parallel.

> Are there any sort of best
> practices I should follow to get optimal performance?

a) Avoid creation of files
Since you want to benchmark the read and write performance, avoid the
creation of many new small files. Otherwise you would also benchmark the
performance of the metadata server (MRC) which is involved in every file
creation. Preferably, read or write a single big file.

b) block size / object size
By default writes are chopped into 128kB objects. Make sure you use the
object size as buffer size or a multiple of it when benchmarking.

Every object on the OSD is stored as a file in the "object_dir" (specified
in the OSD config file [2]). If your used file system / storage device
performs better with bigger files, create a volume with a higher
stripe-size, e.g. mkfs.xtreemfs -s 1024 for 1 MB objects.

128kB is currently also the maximum write size which recent Fuse versions
(2.8) and the Linux kernel currently support. So I *guess* you won't
probably see a big gain if you use bigger objects - but we haven't tested it
yet!

c) async writes
If you use the Fuse Client mount.xtreemfs, you also have to enable
asynchronous writes to increase the throughput. By default the client
performs write()s synchronously, which means it returns from a write() after
the OSD acknowledged it. Applications usually write a file sequentially,
which means the subsequent write() won't be executed before the current one
is acknowledged by the OSD. In consequence, there is a pause between the
*sending* of each write() request to the OSD and this results in a lower
throughput.

Run mount.xtreemfs with --max-writeahead=1310720 to enable async writes. If
you write 128kB buffers, a value of 1310720 bytes means up to 10 objects may
be in flight and confirmed to the application as successfully written before
they are actually acknowledged by the OSD. Experiment with the parameters "
--max-writeahead" and "--max-writeahead-requests" and look if higher values
result in a better throughput. However, keep in mind that the current
implementation does not retry failed async writes. So, if a single async
write does fail, your application has to cope with the EIO error. This will
be fixed in a later version of the XtreemFS client.

d) read ahead

We don't support the opposite of async writes, read ahead, directly.
However, the Fuse library itself does do that. By default mount.xtreemfs
tells Fuse to read ahead 10 * 128kB of data [6]. You should be able to
influence it with the parameter -o max_readahead=<value in bytes>. We have
not analyzed the effect of different values for this parameter on our own
yet.

e) TCP window size / socket size

Even with an increased --max-writeahead, you may come into a situation where
the underlying TCP channel is the limit as the maximum number of pending, by
TCP acknowledged, bytes is reached. That limit is given by the TCP window
size which is the same as the buffer size of the sockets of your operating
system.

Use the bandwidth delay product (BDP) to calculate the theoretical maximum
TCP throughput for a given socket size and network latency. For instance,
let's assume a RTT of 0,2ms (meaning the sender gets an acknowledgement for
its sent TCP packet from the receiver 0,2ms after sending it) and a socket
size of 100kB.

BDP: socket size = latency * bandwidth

100 kB = 0,0002 s * bandwidth
=> bandwidth = ~ 488 MB/s

If the socket size may be the limit, you have to increase the maximum size
in your system. You can change the server config [4] to increase the used
socket size. The Fuse Client and C++ libxtreemfs currently do not have an
option to set the socket size which means you have to increase the default
socket size (together with the maximum size) on the client test system.

> What would be considered optimal performance?

We don't know :-) The numbers in the beginning should give you a guide line.

In general, I expect you to hit the limits of these components in the
following order:
1. Hard disk on the OSD
Today a single hard disk has a maximum sequential read or write rate of
>100MB/s. To get beyond that, you have to aggregate multiple disks into a
RAID or put the "object_dir" [2] of the OSD on a tmpfs/ramfs :-)

2. CPU usage of the Fuse client
The communication between the Linux Kernel and any user file system is very
CPU intensive as all written or read data has to be copied between kernel
and user space. While running "dd" on my laptop and writing at ~100MB/s, the
CPU usage of mount.xtreemfs spiked up to 70% (1 Core) on my Core i7 L640
(2.13GHz) CPU. This limit is not a specific XtreemFS problem: all Fuse file
systems suffer from this.

The alternative is to use our client libraries directly which should be less
CPU intensive. Currently, you can use our C++ libxtreemfs or the Java
version. If you want to try the C++ one, just modify the
"example_libxtreemfs.cpp" [3] to suit your needs. To compile it, check out
the SVN trunk and run "make client". After compiling you can find the binary
"example_libxtreemfs" in the subdirectory "cpp/build".

3. OSD Server process
Depending on your hardware, the servers will also have a maximum throughput.
During my little test, the OSD Java server process showed a CPU usage of up
to 40% (1 Core) on an older Xeon 5130 (2.00GHz) CPU.

When you reached the maximum of a single server, you'll have to stripe your
files on multiple RAIDs, e.g. create a volume where files are striped across
two OSDs with mkfs.xtreemfs -w 2. If you stripe the files across several
OSDs, you also need to enable async writes since writing out the stripes is
currently processed sequentially by the Fuse/C++ client.

> I'd be happy to blog/document my results
> once I get further. Any help is much appreciated.

Sure. And I'm very interested in your results :-) Together we can look into
it and see if there is room for improvement left.

Best regards,
Michael

[1] http://xtreemfs.org/slides/lasco2008.pdf, page 12, write "throughput in
MB/s" graph top right
[2]
http://code.google.com/p/xtreemfs/source/browse/trunk/etc/xos/xtreemfs/osdco
nfig.properties#42
[3]
http://code.google.com/p/xtreemfs/source/browse/trunk/cpp/src/example_libxtr
eemfs/example_libxtreemfs.cpp
[4]
http://code.google.com/p/xtreemfs/source/browse/trunk/etc/xos/xtreemfs/osdco
nfig.properties#68
[5] I used the following parameters for mount.xtreemfs:
--max-writeahead=1310720 To write the file, I used "dd if=/dev/zero
of=/mnt/xtreemfs/benchmark.bin bs=128k" and looked at the output of it
produced by "while true; do killall -USR1 dd; sleep 1; done"
[6]
http://code.google.com/p/xtreemfs/source/browse/trunk/cpp/src/fuse/fuse_oper
ations.cpp#338

Vivioli

unread,

Dec 22, 2011, 11:08:49 PM12/22/11

to XtreemFS

This is my first contact to Xtreemfs, but wouls also like to get some
feedback on your result, especially against GPFS. I do understand
that your not aiming to be a LAN best performance solution, but your
numbers with GbE and IB look very good. I haven't read every page of
the user guide but concerning replication and stripping across
multiple OSD, is there a way to enable striping of data on all the OSD
by default while keeping a copy of every data on at least 2 or more
OSD.

Can this stripping and replication be done in a private network from
the one presenting the storage to my host.

Can't wait to see your numbers on 10GbE.

Sabrina

On 9 déc, 17:30, "Michael Berlin"

> [2]http://code.google.com/p/xtreemfs/source/browse/trunk/etc/xos/xtreemf...
> nfig.properties#42
> [3]http://code.google.com/p/xtreemfs/source/browse/trunk/cpp/src/example...
> eemfs/example_libxtreemfs.cpp
> [4]http://code.google.com/p/xtreemfs/source/browse/trunk/etc/xos/xtreemf...

> nfig.properties#68
> [5] I used the following parameters for mount.xtreemfs:
> --max-writeahead=1310720 To write the file, I used "dd if=/dev/zero
> of=/mnt/xtreemfs/benchmark.bin bs=128k" and looked at the output of it
> produced by "while true; do killall -USR1 dd; sleep 1; done"

> [6]http://code.google.com/p/xtreemfs/source/browse/trunk/cpp/src/fuse/fu...
> ations.cpp#338

Michael Berlin

unread,

Jan 4, 2012, 5:27:59 AM1/4/12

to xtre...@googlegroups.com

Hi Sabrina,

> I haven't read every page of
> the user guide but concerning replication and stripping across
> multiple OSD, is there a way to enable striping of data on all the OSD
> by default while keeping a copy of every data on at least 2 or more
> OSD.

Replicating striped files works only in case of the read-only
replication and not with the read-write replication.

If you enable the read-only replication on a volume or path, closed
files are automatically marked as read-only and get replicated afterwards.

If you want to have two additional copies, you have to set the
replication factor to 3. If all replicas shall have the same striping
pattern (e.g. always striped across 4 OSDs), you can create a volume
with the correct stripe width (mkfs.xtreemfs xtreemfs-mrc/striped_volume
-w 4) and set up the default replication policy on the mounted volume:
xtfsutil --set-drp --replication-policy readonly --replication-factor 3
--full <mount path>

> Can this stripping and replication be done in a private network from
> the one presenting the storage to my host.

Do you mean a private network which is separated from the usual storage
infrastructure? That won't work: Within an XtreemFS installation all
nodes have to be able to reach each other. Otherwise it won't work to
replicate the data.

> Can't wait to see your numbers on 10GbE.

I won't do any benchmarks on that in the near future. But I invite
everybody to contribute his findings :-)

Best regards,
Michael

Reply all

Reply to author

Forward