rewrite

90 views
Skip to first unread message

Renich Bon Ciric

unread,
Feb 14, 2012, 2:55:43 AM2/14/12
to xtre...@googlegroups.com
Would you, ever, rewrite this in google go, c, c++ or some language that doesn't eat all your memory while being slow as hell?

Sorry for the sarcasm; I was all excited about it... until I learned it's written in Java... no offense meant...

Felix Hupfeld

unread,
Feb 14, 2012, 3:58:08 AM2/14/12
to xtre...@googlegroups.com
Am 14. Februar 2012 07:55 schrieb Renich Bon Ciric <ren...@woralelandia.com>:
Would you, ever, rewrite this in google go, c, c++ or some language that doesn't eat all your memory while being slow as hell?

When we find a performance problem with the JVM...

Seriously, Java's bottleneck is memory, if you don't approach it consciously. But XtreemFS does very careful memory management, and we haven't seen Java/JVM to be a problem. You might run into issues if you do IO at bleeding edge speeds, but that's not what XtreemFS is aiming for, and it is fine doing high performance IO.

If we wouldn't have used Java and do user-level drivers, we would still  be hacking low level code with the man power that was available. Instead we have a full file system with Paxos-based fault-tolerant replication on basically all levels. Compare this with projects that chose other languages.

Renich Bon Ciric

unread,
Feb 15, 2012, 4:30:35 AM2/15/12
to xtre...@googlegroups.com
On Tue, Feb 14, 2012 at 2:58 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:
> When we find a performance problem with the JVM...
>
> Seriously, Java's bottleneck is memory, if you don't approach it
> consciously. But XtreemFS does very careful memory management, and we
> haven't seen Java/JVM to be a problem. You might run into issues if you do
> IO at bleeding edge speeds, but that's not what XtreemFS is aiming for, and
> it is fine doing high performance IO.
>
> If we wouldn't have used Java and do user-level drivers, we would still  be
> hacking low level code with the man power that was available. Instead we
> have a full file system with Paxos-based fault-tolerant replication on
> basically all levels. Compare this with projects that chose other languages.

You are very kind to answer.

Features are incredible. I love them. But speed is very important. We
have a 10Gbps network and we're planning to setup 800 nodes and we
need top performance... We're planning on scaling up to 40 Gbps
soon...

Geo-replication sounds awesome too. Stripping, we need it (for the cloud)...

All this, but the object based storage, is present on Ceph and
Gluster. I like all three projects... and I think you need to think
seriously about speed.

--
It's hard to be free... but I love to struggle. Love isn't asked for;
it's just given. Respect isn't asked for; it's earned!
Renich Bon Ciric

http://www.woralelandia.com/
http://www.introbella.com/

Felix Hupfeld

unread,
Feb 15, 2012, 4:42:55 AM2/15/12
to xtre...@googlegroups.com
Am 15. Februar 2012 09:30 schrieb Renich Bon Ciric <ren...@woralelandia.com>:
On Tue, Feb 14, 2012 at 2:58 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:
> When we find a performance problem with the JVM...
>
> Seriously, Java's bottleneck is memory, if you don't approach it
> consciously. But XtreemFS does very careful memory management, and we
> haven't seen Java/JVM to be a problem. You might run into issues if you do
> IO at bleeding edge speeds, but that's not what XtreemFS is aiming for, and
> it is fine doing high performance IO.
>
> If we wouldn't have used Java and do user-level drivers, we would still  be
> hacking low level code with the man power that was available. Instead we
> have a full file system with Paxos-based fault-tolerant replication on
> basically all levels. Compare this with projects that chose other languages.

You are very kind to answer.

Features are incredible. I love them. But speed is very important. We
have a 10Gbps network and we're planning to setup 800 nodes and we
need top performance... We're planning on scaling up to 40 Gbps
soon...

10 Gbps... come on ;) With bleeding edge I meant huge cluster installations with latest technology, which usually have to use Lustre and do not go over TCP/IP. You can contest if that is an actual file system usage what they do, but still.

We successfully saturated Infiniband in 2008: http://xtreemfs.org/publications/striping-lasco-camera.pdf

Please go ahead and evaluate XtreemFS on your hardware. We are curious where the bottleneck is. Quite likely it will be network, with some CPU usage, but I guess a 40 Gbps connected machine will have enough cores.

Renich Bon Ciric

unread,
Feb 16, 2012, 6:31:36 AM2/16/12
to xtre...@googlegroups.com
On Wed, Feb 15, 2012 at 3:42 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:
> 10 Gbps... come on ;) With bleeding edge I meant huge cluster installations
> with latest technology, which usually have to use Lustre and do not go over
> TCP/IP. You can contest if that is an actual file system usage what they do,
> but still.
>
> We successfully saturated Infiniband in
> 2008: http://xtreemfs.org/publications/striping-lasco-camera.pdf

Ah, so good to read this! We will test it... definitely... If I can
make it run on fedora, that is ;) (having problems with init scripts
using sudo; and not being able to find java)

> Please go ahead and evaluate XtreemFS on your hardware. We are curious where
> the bottleneck is. Quite likely it will be network, with some CPU usage, but
> I guess a 40 Gbps connected machine will have enough cores.

Ok, well, I'll tell you how it went. I'll do some dd and other benchmarks...

Felix Hupfeld

unread,
Feb 16, 2012, 9:00:26 AM2/16/12
to xtre...@googlegroups.com
Am 16. Februar 2012 11:31 schrieb Renich Bon Ciric <ren...@woralelandia.com>:
On Wed, Feb 15, 2012 at 3:42 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:
> 10 Gbps... come on ;) With bleeding edge I meant huge cluster installations
> with latest technology, which usually have to use Lustre and do not go over
> TCP/IP. You can contest if that is an actual file system usage what they do,
> but still.
>
> We successfully saturated Infiniband in
> 2008: http://xtreemfs.org/publications/striping-lasco-camera.pdf

Ah, so good to read this! We will test it... definitely... If I can
make it run on fedora, that is ;) (having problems with init scripts
using sudo; and not being able to find java)

We are curiously waiting for your report. I guess I do not have to state that explicitly, but keep an eye on block sizes at levels (object sizes, write sizes, ... they need a certain size) and monitor CPU usage, because that's the cost for maxing out your NICs. The better you document everything the more valuable it will be. If you need help in performance tuning, please come ask.

If you want to max out not only a single node but the whole cluster, be aware of your backplane bandwidth. It might be that it does not do 800 * 10 Gbps.

Renich Bon Ciric

unread,
Feb 16, 2012, 3:59:39 PM2/16/12
to xtre...@googlegroups.com
On Thu, Feb 16, 2012 at 8:00 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:
> We are curiously waiting for your report. I guess I do not have to state
> that explicitly, but keep an eye on block sizes at levels (object sizes,
> write sizes, ... they need a certain size) and monitor CPU usage, because
> that's the cost for maxing out your NICs. The better you document everything
> the more valuable it will be. If you need help in performance tuning, please
> come ask.

You're considerate and kind. Thanks for the overall tips and help. I
will try to document things as best as possible.

Fase 1 is a 10 Gbps connection. Fase two would be 40 Gbps in our
infrastructure and this will take a bit of time; but it's close.

> If you want to max out not only a single node but the whole cluster, be
> aware of your backplane bandwidth. It might be that it does not do 800 * 10
> Gbps.

Ah, yeah. Our testing bed is 11 servers. Once implemented, we will
expand to 800 nodes; with rack interconnections between 40 Gbps and 80
Gbps; not really sure.

Anyway, thanks a lot for the help offering. I'll get back to you when
I'm running.

Felix Hupfeld

unread,
Feb 17, 2012, 6:53:56 AM2/17/12
to xtre...@googlegroups.com
Am 16. Februar 2012 20:59 schrieb Renich Bon Ciric <ren...@woralelandia.com>:
On Thu, Feb 16, 2012 at 8:00 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:
> We are curiously waiting for your report. I guess I do not have to state
> that explicitly, but keep an eye on block sizes at levels (object sizes,
> write sizes, ... they need a certain size) and monitor CPU usage, because
> that's the cost for maxing out your NICs. The better you document everything
> the more valuable it will be. If you need help in performance tuning, please
> come ask.

You're considerate and kind. Thanks for the overall tips and help. I
will try to document things as best as possible.

Fase 1 is a 10 Gbps connection. Fase two would be 40 Gbps in our
infrastructure and this will take a bit of time; but it's close.

> If you want to max out not only a single node but the whole cluster, be
> aware of your backplane bandwidth. It might be that it does not do 800 * 10
> Gbps.

Ah, yeah. Our testing bed is 11 servers. Once implemented, we will
expand to 800 nodes; with rack interconnections between 40 Gbps and 80
Gbps; not really sure.

Anyway, thanks a lot for the help offering. I'll get back to you when
I'm running.

Oh sweet irony, our numbers from the paper were done with the Java client. Currently you probably will only get max. performance when using the Java or C++ client library. The FUSE client's write cache is not complete yet, and so you're stuck with FUSE's 128k write splits. So depending on your application, the advise will be to use libxtreemfs directly and bypass the kernel.

Michael has more details, here's some advice in that direction:

... Felix

Renich Bon Ciric

unread,
Feb 24, 2012, 5:06:42 AM2/24/12
to xtre...@googlegroups.com
On Fri, Feb 17, 2012 at 5:53 AM, Felix Hupfeld <fhup...@googlemail.com> wrote:
> Oh sweet irony, our numbers from the paper were done with the Java client.
> Currently you probably will only get max. performance when using the Java or
> C++ client library. The FUSE client's write cache is not complete yet, and
> so you're stuck with FUSE's 128k write splits. So depending on your
> application, the advise will be to use libxtreemfs directly and bypass the
> kernel.

Ok, did some tests with 10 OSDs, 1 MCR and 1 DIR.

Created a raid0, -w 10 volume, with -a VOLUME.

Mounted it on a separate server and wrote to it (usual dd if=/dev/zero
... bs=1M... )

My top performance reached 87 MB/s

On the other hand, If I write directly to the underlying mountpoint
(/var/lib/xtreemfs/objs), I can do:

# dd if=/dev/zero of=file bs=1M count=10k
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 7.71371 s, 1.4 GB/s

On the other hand, tried using nc over this file; from host to host
and I could reach 1.2 GB/s.

I'll read on the link you sent... also, considering the libxtreemfs thingy ;)

Michael Berlin

unread,
Feb 24, 2012, 5:35:49 AM2/24/12
to xtre...@googlegroups.com
Hi Renich,

> Mounted it on a separate server and wrote to it (usual dd if=/dev/zero
> ... bs=1M... )
>
> My top performance reached 87 MB/s

Writes through FUSE are limited to 128 kB. So a single write() request
processed by the OSD will never contain more than 128 kB.

Let's calculate the total duration of the write() operation given the 87
MB/s throughput.

duration = size per write / through put

duration = 128 kB / (87 MB/s * 1024) = 0.0014s

So your system is able to process a write request within 1,4 ms - that's
pretty fast. Since you did not mention anything, I assume you did not
enable the asynchronous writes? Enable them by setting the mount option
"--max-writeahead=1310720".

By default the client executes the write()s synchronously, so it won't
respond to the application until the write() was acknowledged by the
OSD. Consequently, the duration of the write operation defines the
maximum throughput.

This is also true if striping is used: By default there are no parallel
writes since the client cannot detect that dd is currently writing 1MB
chunks. It only sees the 128 kB write chunks received from FUSE and is
not aware it would be allowed to write 1 MB in parallel. So, if you want
to leverage striping, you'll always have to set the --max-writeahead to
"stripe width * stripe size".

You wrote that a single OSD could write with > 1GB/s to disk. Even if
you don't use striping, you'll reach a higher throughput with enabled
asynchronous writes (see also my mailing list posting on the performance
subject).

Please note, that the current asynchronous write feature does not
support retry requests. It also contains a bug which can crash the
client with a segfault. I will fix the latter one next week.

Since you want to go for optimal performance, you'll sooner or later
have to increase the default stripe width from 128 kB to higher values
e.g., 1 MB. But keep in mind that FUSE chops every write() into 128 kB
chunks, so you'll definitely have to switch over to the libxtreemfs then.

Regards,
Michael

Renich Bon Ciric

unread,
Feb 24, 2012, 5:49:49 AM2/24/12
to xtre...@googlegroups.com
On Fri, Feb 24, 2012 at 4:35 AM, Michael Berlin
<michael.ber...@googlemail.com> wrote:
> Writes through FUSE are limited to 128 kB. So a single write() request
> processed by the OSD will never contain more than 128 kB.
>
> Let's calculate the total duration of the write() operation given the 87
> MB/s throughput.
>
> duration = size per write / through put
>
> duration = 128 kB / (87 MB/s * 1024) = 0.0014s
>
> So your system is able to process a write request within 1,4 ms - that's
> pretty fast. Since you did not mention anything, I assume you did not enable
> the asynchronous writes? Enable them by setting the mount option
> "--max-writeahead=1310720".

Ok, this setting alone rose the performance to 270 MB/s.

> By default the client executes the write()s synchronously, so it won't
> respond to the application until the write() was acknowledged by the OSD.
> Consequently, the duration of the write operation defines the maximum
> throughput.
>
> This is also true if striping is used: By default there are no parallel
> writes since the client cannot detect that dd is currently writing 1MB
> chunks. It only sees the 128 kB write chunks received from FUSE and is not
> aware it would be allowed to write 1 MB in parallel. So, if you want to
> leverage striping, you'll always have to set the --max-writeahead to "stripe
> width * stripe size".
>
> You wrote that a single OSD could write with > 1GB/s to disk. Even if you
> don't use striping, you'll reach a higher throughput with enabled
> asynchronous writes (see also my mailing list posting on the performance
> subject).
>
> Please note, that the current asynchronous write feature does not support
> retry requests. It also contains a bug which can crash the client with a
> segfault. I will fix the latter one next week.

Understood. I'll be ready to rebuild my RPMs ;)

> Since you want to go for optimal performance, you'll sooner or later have to
> increase the default stripe width from 128 kB to higher values e.g., 1 MB.
> But keep in mind that FUSE chops every write() into 128 kB chunks, so you'll
> definitely have to switch over to the libxtreemfs then.

Ok, how can I use libxtreemfs? do we have to write a client in order
for it to be used?

Renich Bon Ciric

unread,
Feb 24, 2012, 5:55:39 AM2/24/12
to xtre...@googlegroups.com
Ok, just a side-note.

This puts 10 dd on the background; writting ~1 GB:

n=0;
while (( n < 10 )); do
dd if=/dev/zero of=file$n bs=1M count=1k &
n=$(( n + 1 ));
done

Just wanted to report > 50 MB/s on each process. This is really cool.

Michael Berlin

unread,
Feb 24, 2012, 6:19:18 AM2/24/12
to xtre...@googlegroups.com
Hi Renich,

> Ok, how can I use libxtreemfs? do we have to write a client in order
> for it to be used?

The interface to it is defined in the three files client.h, volume.h and
file_handle.h [1].

The current CMake specification generates only a static libxtreemfs.
Therefore I suggest to start by modifying the code of
"example_libxtreemfs.cpp" [2]. You can easily rewrite it to implement
your own "dd" using the libxtreemfs.

Here's how to compile and run it:

0. Install required libraries and dev(el) packages
In particular, you need boost, libssl-dev and libfuse-dev.

Also install "cmake" which generates the build system.

1. Check out the XtreemFS trunk:

svn checkout http://xtreemfs.googlecode.com/svn/trunk/ xtreemfs-read-only

2. Compile the Client. Currently you can only compile or all nothing, so
the FUSE client mount.xtreemfs will be automatically compiled and the
binary "example_libxtreemfs", too.

cd xtreemfs-read-only
make client_debug
(same as make client but with debug symbols enabled)

3. Run the example_libxtreemfs

cd cpp/build
./example_libxtreemfs

4. Subsequent modifications

Now modify cpp/src/example_lbixtreemfs/example_libxtreemfs.cpp to suit
your needs. In order to leverage all cores of your machine, you'll
probably have to use threads, too. Since we're already using boost, you
could use boost::thread for that purpose.

"make client_debug" was only needed the first time for letting CMake
generate the Makefiles to build everything. From now on, just run "make"
inside "cpp/build" and it will compile the changed code.

Regards,
Michael

[1]
http://code.google.com/p/xtreemfs/source/browse/#svn%2Ftrunk%2Fcpp%2Finclude%2Flibxtreemfs
[2]
http://code.google.com/p/xtreemfs/source/browse/trunk/cpp/src/example_libxtreemfs/example_libxtreemfs.cpp

Renich Bon Ciric

unread,
Feb 24, 2012, 6:23:14 AM2/24/12
to xtre...@googlegroups.com
On Fri, Feb 24, 2012 at 5:19 AM, Michael Berlin
<michael.ber...@googlemail.com> wrote:
> ...

as helpful as always. Thanks a lot. ;=)

Michael Berlin

unread,
Feb 24, 2012, 6:23:41 AM2/24/12
to xtre...@googlegroups.com

How does it look in "top"? Sounds like FUSE uses more threads if you run
multiple dd's in comparison to running a single dd (althrough
asynchronous writes are enabled).

If you don't see all threads, run "top -H".

Michael

Renich Bon Ciric

unread,
Feb 24, 2012, 6:26:02 AM2/24/12
to xtre...@googlegroups.com
On Fri, Feb 24, 2012 at 5:23 AM, Michael Berlin
<michael.ber...@googlemail.com> wrote:
> How does it look in "top"? Sounds like FUSE uses more threads if you run
> multiple dd's in comparison to running a single dd (althrough asynchronous
> writes are enabled).
>
> If you don't see all threads, run "top -H".
>
> Michael

Here you go.

Screenshot at 2012-02-24 05:25:04.png

Renich Bon Ciric

unread,
Feb 24, 2012, 7:54:12 PM2/24/12
to xtre...@googlegroups.com
Another interesting thing.

I am using:

# big writes: http://old.nabble.com/Status-of-write()-block-size-limit-td25084815.html
mount.xtreemfs --fuse_option big_writes phost36.lvs.cloudsigma.com/test test

And doing something like:


n=0; while (( n < 10 )); do dd if=/dev/zero of=file$n bs=1M count=1k &

n=$(( n + 1 )); done;

On 10 OSDs + 1 MRC + 1 DIR.

The mount point is on another server; same net.

The above command gives me, constant, ~50 GB/s on every process. If I
climb up to 20, it gets reduced by half. And 5 gives ~65 MB/s.

This is very interesting, indeed. I will do a stress test with virtual
machines. 10 libvirt + KVM/qemu on every host; writing 1 Gb on
different bs and see what happens. Let's see how does the filesystem
respond to this.

I'll get back with results. Feel free to suggest anything here.

Michael Berlin

unread,
Feb 27, 2012, 4:51:48 AM2/27/12
to xtre...@googlegroups.com
Hi,

On 02/25/2012 01:54 AM, Renich Bon Ciric wrote:
> Another interesting thing.
>
> I am using:
>
> # big writes: http://old.nabble.com/Status-of-write()-block-size-limit-td25084815.html
> mount.xtreemfs --fuse_option big_writes phost36.lvs.cloudsigma.com/test test

The client does automatically enable "big_writes" if it's available on
your system.

So specifying this option shouldn't have made a difference ;-)

> And doing something like:
> n=0; while (( n< 10 )); do dd if=/dev/zero of=file$n bs=1M count=1k&
> n=$(( n + 1 )); done;
>
> On 10 OSDs + 1 MRC + 1 DIR.
>
> The mount point is on another server; same net.
>
> The above command gives me, constant, ~50 GB/s on every process. If I
> climb up to 20, it gets reduced by half. And 5 gives ~65 MB/s.

Can you please also have a look at the performance of a single dd? Try
to increase the --max-writeahead (for instance with multiples of 10 *
128 * 1024 since 10 is your striping width). Please notice that you'll
also have to specify the parameter --max-writeahead-requests then since
you will be deviating from the default of allowed 10 pending write()
operations.

> This is very interesting, indeed. I will do a stress test with virtual
> machines. 10 libvirt + KVM/qemu on every host; writing 1 Gb on
> different bs and see what happens. Let's see how does the filesystem
> respond to this.

In the screenshot you have supplied, you could see that one
mount.xtreemfs thread was busier than the rest and close to 100%. That's
probably the internal network thread which parses the responses of the
servers. If you hit a limit there, you'll probably have to spread the
load across several mount.xtreemfs instances.

Regards,
Michael

Renich Bon Ciric

unread,
Feb 27, 2012, 5:00:59 AM2/27/12
to xtre...@googlegroups.com
On Mon, Feb 27, 2012 at 3:51 AM, Michael Berlin
<michael.ber...@googlemail.com> wrote:
> Can you please also have a look at the performance of a single dd? Try to
> increase the --max-writeahead (for instance with multiples of 10 * 128 *
> 1024 since 10 is your striping width). Please notice that you'll also have
> to specify the parameter --max-writeahead-requests then since you will be
> deviating from the default of allowed 10 pending write() operations.

Ok, I'll do this test this week. I need to get some results back on
XtreemFS. Thank you for the indications ;)

> In the screenshot you have supplied, you could see that one mount.xtreemfs
> thread was busier than the rest and close to 100%. That's probably the
> internal network thread which parses the responses of the servers. If you
> hit a limit there, you'll probably have to spread the load across several
> mount.xtreemfs instances.

Ok, this sounds interesting. What you're saying is, mount the volume
several times and write to each mount independently?

Michael Berlin

unread,
Feb 27, 2012, 5:09:45 AM2/27/12
to xtre...@googlegroups.com
> Ok, this sounds interesting. What you're saying is, mount the volume
> several times and write to each mount independently?

Yes.

Michael

Renich Bon Ciric

unread,
May 22, 2012, 11:46:45 AM5/22/12
to xtre...@googlegroups.com


On Friday, February 17, 2012 5:53:56 AM UTC-6, Felix Hupfeld wrote:
Oh sweet irony, our numbers from the paper were done with the Java client. Currently you probably will only get max. performance when using the Java or C++ client library. The FUSE client's write cache is not complete yet, and so you're stuck with FUSE's 128k write splits. So depending on your application, the advise will be to use libxtreemfs directly and bypass the kernel.

Michael has more details, here's some advice in that direction:

... Felix

Is anybody here "hireable" in order to write a client for us? Integration with qemu would be awesome as well. Please, contact me if possible. 
Reply all
Reply to author
Forward
0 new messages