Would you, ever, rewrite this in google go, c, c++ or some language that doesn't eat all your memory while being slow as hell?
You are very kind to answer.
Features are incredible. I love them. But speed is very important. We
have a 10Gbps network and we're planning to setup 800 nodes and we
need top performance... We're planning on scaling up to 40 Gbps
soon...
Geo-replication sounds awesome too. Stripping, we need it (for the cloud)...
All this, but the object based storage, is present on Ceph and
Gluster. I like all three projects... and I think you need to think
seriously about speed.
--
It's hard to be free... but I love to struggle. Love isn't asked for;
it's just given. Respect isn't asked for; it's earned!
Renich Bon Ciric
On Tue, Feb 14, 2012 at 2:58 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:You are very kind to answer.
> When we find a performance problem with the JVM...
>
> Seriously, Java's bottleneck is memory, if you don't approach it
> consciously. But XtreemFS does very careful memory management, and we
> haven't seen Java/JVM to be a problem. You might run into issues if you do
> IO at bleeding edge speeds, but that's not what XtreemFS is aiming for, and
> it is fine doing high performance IO.
>
> If we wouldn't have used Java and do user-level drivers, we would still be
> hacking low level code with the man power that was available. Instead we
> have a full file system with Paxos-based fault-tolerant replication on
> basically all levels. Compare this with projects that chose other languages.
Features are incredible. I love them. But speed is very important. We
have a 10Gbps network and we're planning to setup 800 nodes and we
need top performance... We're planning on scaling up to 40 Gbps
soon...
Ah, so good to read this! We will test it... definitely... If I can
make it run on fedora, that is ;) (having problems with init scripts
using sudo; and not being able to find java)
> Please go ahead and evaluate XtreemFS on your hardware. We are curious where
> the bottleneck is. Quite likely it will be network, with some CPU usage, but
> I guess a 40 Gbps connected machine will have enough cores.
Ok, well, I'll tell you how it went. I'll do some dd and other benchmarks...
On Wed, Feb 15, 2012 at 3:42 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:Ah, so good to read this! We will test it... definitely... If I can
> 10 Gbps... come on ;) With bleeding edge I meant huge cluster installations
> with latest technology, which usually have to use Lustre and do not go over
> TCP/IP. You can contest if that is an actual file system usage what they do,
> but still.
>
> We successfully saturated Infiniband in
> 2008: http://xtreemfs.org/publications/striping-lasco-camera.pdf
make it run on fedora, that is ;) (having problems with init scripts
using sudo; and not being able to find java)
You're considerate and kind. Thanks for the overall tips and help. I
will try to document things as best as possible.
Fase 1 is a 10 Gbps connection. Fase two would be 40 Gbps in our
infrastructure and this will take a bit of time; but it's close.
> If you want to max out not only a single node but the whole cluster, be
> aware of your backplane bandwidth. It might be that it does not do 800 * 10
> Gbps.
Ah, yeah. Our testing bed is 11 servers. Once implemented, we will
expand to 800 nodes; with rack interconnections between 40 Gbps and 80
Gbps; not really sure.
Anyway, thanks a lot for the help offering. I'll get back to you when
I'm running.
On Thu, Feb 16, 2012 at 8:00 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:You're considerate and kind. Thanks for the overall tips and help. I
> We are curiously waiting for your report. I guess I do not have to state
> that explicitly, but keep an eye on block sizes at levels (object sizes,
> write sizes, ... they need a certain size) and monitor CPU usage, because
> that's the cost for maxing out your NICs. The better you document everything
> the more valuable it will be. If you need help in performance tuning, please
> come ask.
will try to document things as best as possible.
Fase 1 is a 10 Gbps connection. Fase two would be 40 Gbps in our
infrastructure and this will take a bit of time; but it's close.
Ah, yeah. Our testing bed is 11 servers. Once implemented, we will
> If you want to max out not only a single node but the whole cluster, be
> aware of your backplane bandwidth. It might be that it does not do 800 * 10
> Gbps.
expand to 800 nodes; with rack interconnections between 40 Gbps and 80
Gbps; not really sure.
Anyway, thanks a lot for the help offering. I'll get back to you when
I'm running.
Ok, did some tests with 10 OSDs, 1 MCR and 1 DIR.
Created a raid0, -w 10 volume, with -a VOLUME.
Mounted it on a separate server and wrote to it (usual dd if=/dev/zero
... bs=1M... )
My top performance reached 87 MB/s
On the other hand, If I write directly to the underlying mountpoint
(/var/lib/xtreemfs/objs), I can do:
# dd if=/dev/zero of=file bs=1M count=10k
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 7.71371 s, 1.4 GB/s
On the other hand, tried using nc over this file; from host to host
and I could reach 1.2 GB/s.
I'll read on the link you sent... also, considering the libxtreemfs thingy ;)
> Mounted it on a separate server and wrote to it (usual dd if=/dev/zero
> ... bs=1M... )
>
> My top performance reached 87 MB/s
Writes through FUSE are limited to 128 kB. So a single write() request
processed by the OSD will never contain more than 128 kB.
Let's calculate the total duration of the write() operation given the 87
MB/s throughput.
duration = size per write / through put
duration = 128 kB / (87 MB/s * 1024) = 0.0014s
So your system is able to process a write request within 1,4 ms - that's
pretty fast. Since you did not mention anything, I assume you did not
enable the asynchronous writes? Enable them by setting the mount option
"--max-writeahead=1310720".
By default the client executes the write()s synchronously, so it won't
respond to the application until the write() was acknowledged by the
OSD. Consequently, the duration of the write operation defines the
maximum throughput.
This is also true if striping is used: By default there are no parallel
writes since the client cannot detect that dd is currently writing 1MB
chunks. It only sees the 128 kB write chunks received from FUSE and is
not aware it would be allowed to write 1 MB in parallel. So, if you want
to leverage striping, you'll always have to set the --max-writeahead to
"stripe width * stripe size".
You wrote that a single OSD could write with > 1GB/s to disk. Even if
you don't use striping, you'll reach a higher throughput with enabled
asynchronous writes (see also my mailing list posting on the performance
subject).
Please note, that the current asynchronous write feature does not
support retry requests. It also contains a bug which can crash the
client with a segfault. I will fix the latter one next week.
Since you want to go for optimal performance, you'll sooner or later
have to increase the default stripe width from 128 kB to higher values
e.g., 1 MB. But keep in mind that FUSE chops every write() into 128 kB
chunks, so you'll definitely have to switch over to the libxtreemfs then.
Regards,
Michael
Ok, this setting alone rose the performance to 270 MB/s.
> By default the client executes the write()s synchronously, so it won't
> respond to the application until the write() was acknowledged by the OSD.
> Consequently, the duration of the write operation defines the maximum
> throughput.
>
> This is also true if striping is used: By default there are no parallel
> writes since the client cannot detect that dd is currently writing 1MB
> chunks. It only sees the 128 kB write chunks received from FUSE and is not
> aware it would be allowed to write 1 MB in parallel. So, if you want to
> leverage striping, you'll always have to set the --max-writeahead to "stripe
> width * stripe size".
>
> You wrote that a single OSD could write with > 1GB/s to disk. Even if you
> don't use striping, you'll reach a higher throughput with enabled
> asynchronous writes (see also my mailing list posting on the performance
> subject).
>
> Please note, that the current asynchronous write feature does not support
> retry requests. It also contains a bug which can crash the client with a
> segfault. I will fix the latter one next week.
Understood. I'll be ready to rebuild my RPMs ;)
> Since you want to go for optimal performance, you'll sooner or later have to
> increase the default stripe width from 128 kB to higher values e.g., 1 MB.
> But keep in mind that FUSE chops every write() into 128 kB chunks, so you'll
> definitely have to switch over to the libxtreemfs then.
Ok, how can I use libxtreemfs? do we have to write a client in order
for it to be used?
This puts 10 dd on the background; writting ~1 GB:
n=0;
while (( n < 10 )); do
dd if=/dev/zero of=file$n bs=1M count=1k &
n=$(( n + 1 ));
done
Just wanted to report > 50 MB/s on each process. This is really cool.
> Ok, how can I use libxtreemfs? do we have to write a client in order
> for it to be used?
The interface to it is defined in the three files client.h, volume.h and
file_handle.h [1].
The current CMake specification generates only a static libxtreemfs.
Therefore I suggest to start by modifying the code of
"example_libxtreemfs.cpp" [2]. You can easily rewrite it to implement
your own "dd" using the libxtreemfs.
Here's how to compile and run it:
0. Install required libraries and dev(el) packages
In particular, you need boost, libssl-dev and libfuse-dev.
Also install "cmake" which generates the build system.
1. Check out the XtreemFS trunk:
svn checkout http://xtreemfs.googlecode.com/svn/trunk/ xtreemfs-read-only
2. Compile the Client. Currently you can only compile or all nothing, so
the FUSE client mount.xtreemfs will be automatically compiled and the
binary "example_libxtreemfs", too.
cd xtreemfs-read-only
make client_debug
(same as make client but with debug symbols enabled)
3. Run the example_libxtreemfs
cd cpp/build
./example_libxtreemfs
4. Subsequent modifications
Now modify cpp/src/example_lbixtreemfs/example_libxtreemfs.cpp to suit
your needs. In order to leverage all cores of your machine, you'll
probably have to use threads, too. Since we're already using boost, you
could use boost::thread for that purpose.
"make client_debug" was only needed the first time for letting CMake
generate the Makefiles to build everything. From now on, just run "make"
inside "cpp/build" and it will compile the changed code.
Regards,
Michael
[1]
http://code.google.com/p/xtreemfs/source/browse/#svn%2Ftrunk%2Fcpp%2Finclude%2Flibxtreemfs
[2]
http://code.google.com/p/xtreemfs/source/browse/trunk/cpp/src/example_libxtreemfs/example_libxtreemfs.cpp
as helpful as always. Thanks a lot. ;=)
How does it look in "top"? Sounds like FUSE uses more threads if you run
multiple dd's in comparison to running a single dd (althrough
asynchronous writes are enabled).
If you don't see all threads, run "top -H".
Michael
Here you go.
I am using:
# big writes: http://old.nabble.com/Status-of-write()-block-size-limit-td25084815.html
mount.xtreemfs --fuse_option big_writes phost36.lvs.cloudsigma.com/test test
And doing something like:
n=0; while (( n < 10 )); do dd if=/dev/zero of=file$n bs=1M count=1k &
n=$(( n + 1 )); done;
On 10 OSDs + 1 MRC + 1 DIR.
The mount point is on another server; same net.
The above command gives me, constant, ~50 GB/s on every process. If I
climb up to 20, it gets reduced by half. And 5 gives ~65 MB/s.
This is very interesting, indeed. I will do a stress test with virtual
machines. 10 libvirt + KVM/qemu on every host; writing 1 Gb on
different bs and see what happens. Let's see how does the filesystem
respond to this.
I'll get back with results. Feel free to suggest anything here.
On 02/25/2012 01:54 AM, Renich Bon Ciric wrote:
> Another interesting thing.
>
> I am using:
>
> # big writes: http://old.nabble.com/Status-of-write()-block-size-limit-td25084815.html
> mount.xtreemfs --fuse_option big_writes phost36.lvs.cloudsigma.com/test test
The client does automatically enable "big_writes" if it's available on
your system.
So specifying this option shouldn't have made a difference ;-)
> And doing something like:
> n=0; while (( n< 10 )); do dd if=/dev/zero of=file$n bs=1M count=1k&
> n=$(( n + 1 )); done;
>
> On 10 OSDs + 1 MRC + 1 DIR.
>
> The mount point is on another server; same net.
>
> The above command gives me, constant, ~50 GB/s on every process. If I
> climb up to 20, it gets reduced by half. And 5 gives ~65 MB/s.
Can you please also have a look at the performance of a single dd? Try
to increase the --max-writeahead (for instance with multiples of 10 *
128 * 1024 since 10 is your striping width). Please notice that you'll
also have to specify the parameter --max-writeahead-requests then since
you will be deviating from the default of allowed 10 pending write()
operations.
> This is very interesting, indeed. I will do a stress test with virtual
> machines. 10 libvirt + KVM/qemu on every host; writing 1 Gb on
> different bs and see what happens. Let's see how does the filesystem
> respond to this.
In the screenshot you have supplied, you could see that one
mount.xtreemfs thread was busier than the rest and close to 100%. That's
probably the internal network thread which parses the responses of the
servers. If you hit a limit there, you'll probably have to spread the
load across several mount.xtreemfs instances.
Regards,
Michael
Ok, I'll do this test this week. I need to get some results back on
XtreemFS. Thank you for the indications ;)
> In the screenshot you have supplied, you could see that one mount.xtreemfs
> thread was busier than the rest and close to 100%. That's probably the
> internal network thread which parses the responses of the servers. If you
> hit a limit there, you'll probably have to spread the load across several
> mount.xtreemfs instances.
Ok, this sounds interesting. What you're saying is, mount the volume
several times and write to each mount independently?
Yes.
Michael
Oh sweet irony, our numbers from the paper were done with the Java client. Currently you probably will only get max. performance when using the Java or C++ client library. The FUSE client's write cache is not complete yet, and so you're stuck with FUSE's 128k write splits. So depending on your application, the advise will be to use libxtreemfs directly and bypass the kernel.Michael has more details, here's some advice in that direction:... Felix