Mogile FS Compression

77 views
Skip to first unread message

Srinivasan Kidambi

unread,
May 23, 2013, 1:40:59 AM5/23/13
to mog...@googlegroups.com
Hi,

I'm interested to know if Mogile FS Server supports inline compression (compress in server after getting raw data from client). The mogtool supports compression parameters

http://search.cpan.org/dist/MogileFS-Utils/mogtool

but dont know if its done on the MogileFS Server side or on the client machine. Any insight on this would be great! Also the perl Client library does not mention about parameters to send compressing the file

http://search.cpan.org/~dormando/MogileFS-Client-1.16/lib/MogileFS/Client.pm

Does the perl client library support storing files in compressed form and retrieving them?

Thank you all in advance for your help.

Eric Wong

unread,
May 23, 2013, 2:03:49 AM5/23/13
to mog...@googlegroups.com
Srinivasan Kidambi <kidambi...@gmail.com> wrote:
> Hi,
>
> I'm interested to know if Mogile FS Server supports inline compression
> (compress in server after getting raw data from client). The mogtool
> supports compression parameters
>
> http://search.cpan.org/dist/MogileFS-Utils/mogtool

The server will store and serve any data as-is.

mogtool itself is deprecated as it contains some buggy/half-implemented
(mis)features.

> but dont know if its done on the MogileFS Server side or on the client
> machine. Any insight on this would be great! Also the perl Client library
> does not mention about parameters to send compressing the file
>
> http://search.cpan.org/~dormando/MogileFS-Client-1.16/lib/MogileFS/Client.pm
>
> Does the perl client library support storing files in compressed form and
> retrieving them?

The Perl client send/read exactly the data you give it without
transforming.

It's best for your users and application to decide how to best compress
the data.

One of the original use cases for MogileFS was storing compressed images
for web clients (e.g. JPEG, GIF). This way, the compression is
end-to-end and saves network bandwidth all the way to the web client
(which is probably on a slow link and benefits most from compression).

I always store compressed data in MogileFS. I just pick the appropriate
compression depending on the data I have (e.g. FLAC for audio archival,
Vorbis/Opus for audio streaming, gzip for plain-text, ...).

Srinivasan Kidambi

unread,
May 23, 2013, 4:23:23 AM5/23/13
to mog...@googlegroups.com
Thanks a lot for your quick reply!

So the client decides if it wants to compress and the appropriate compression algorithm. I wanted to know this because in my case the client runs on low-end cpu and decompression spikes up the cpu, so wanted to know if it can be offloaded to the server. Thanks for clarifying this.

Also, Could I programmatically create new domains using the perl library MogileFS::Client?
How do I detect if domain is present or not from a client perl script?
Is it good to have millions of files spread across a few hundred domains or better to have all files under 1 domain?

Thanks again in advance for your help.

Eric Wong

unread,
May 23, 2013, 12:52:33 PM5/23/13
to mog...@googlegroups.com
Srinivasan Kidambi <kidambi...@gmail.com> wrote:
> Also, Could I programmatically create new domains using the perl library
> MogileFS::Client?

MogileFS::Admin lets you do it (part of the same CPAN package)

> How do I detect if domain is present or not from a client perl script?

MogileFS::Admin->get_domains returns a hashref of all domains/classes
which you can check against.

From 2.66 onwards, you can safely create a client for a non-existent
domain, try to use it, and recover when you get "unreg_domain" errors.
(This triggered a bug in 2.65 and earlier versions, though).

> Is it good to have millions of files spread across a few hundred domains or
> better to have all files under 1 domain?

It depends on how you want to use the files. If you want to use
the files within the same app, keeping them under one domain will
simplify your code. Likewise, renaming files currently only works
within the same domain.

However if you want to easily find/exclude files of a certain
type, then having the files in different domains can be faster.
Message has been deleted

Srinivasan Kidambi

unread,
May 23, 2013, 5:19:18 PM5/23/13
to mog...@googlegroups.com
Thank you again for your help.

I use MogileFS::Admin to create domains and classes as I need them. I would prefer multiple domains to find files by the domain. Would I be able to delete all files from a given domain (it may contain millions of files) using a single command. If so, how do I do this programmatically and via command line (probably using mogadm).

Also, is it possible to reuse a class for multiple domains, as the classes for all the domains are going to have the same replication policy. So if I could reuse the same class across domains, then I could avoid duplicate classes with the same names and policies in the class table unnecessarily created.

Dave Lambley

unread,
May 24, 2013, 3:45:37 AM5/24/13
to mog...@googlegroups.com
If you don't mind doing the HTTP upload yourself, you can perform the compression while you're uploading.  That may limit your CPU spiking and disc IO.  I have some old (no longer in CPAN) code here,

https://github.com/davel/MogileFS-Client-Async/blob/b76c233a690278b0c2d2c45ede32db23290d2d37/lib/MogileFS/Client/Callback.pm

This lets you supply the data to be uploaded as it is generated.  It's not quite ready for your application as it needs to know the length of the data in advance, but could trivially be patched to send a chunked PUT request.  Your upload routine could then do something like,

open(my $fh, "-|", "bzip2", "-zc", "--", $file) or die $!
my $cb = $mogile->store_from_from_callback($key, $class);

while (sysread(my $fh, $data, 4096)>0) {
    $cb->($data, 0);
}
$cb->("", 1);
close $fh;

You would then need to do something similar when downloading.

Hope this helps.

Dave

--
 
---
You received this message because you are subscribed to the Google Groups "mogile" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mogile+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Eric Wong

unread,
May 24, 2013, 3:37:11 PM5/24/13
to mog...@googlegroups.com
Srinivasan Kidambi <kidambi...@gmail.com> wrote:
> Thank you again for your help.
>
> I use MogileFS::Admin to create domains and classes as I need them. I would
> prefer multiple domains to find files by the domain. Would I be able to
> delete all files from a given domain (it may contain millions of files)
> using a single command. If so, how do I do this programmatically and via
> command line (probably using mogadm).

You can string together moglistkeys and mogdelete via xargs:

# will break if you have odd key names, use at your own risk
moglistkeys -d $D --key_prefix foo | xargs -n1 mogdelete -t -d $D --key

If you have odd key names which require shell quoting, it won't work, and
it's also slow and inefficient for many keys.

It's probably better to write some perl to do it (reusing the same
connections/processes).

# untested, use at your own risk
$mogc->foreach_key(prefix => 'foo', sub {
my ($key) = @_;
$mogc->delete($key);
});

> Also, is it possible to reuse a class for multiple domains, as the classes
> for all the domains are going to have the same replication policy. So if I
> could reuse the same class across domains, then I could avoid duplicate
> classes with the same names and policies in the class table unnecessarily
> created.

No, it's not possible. It would require large, incompatible database/code
changes.

Srinivasan Kidambi

unread,
Jun 4, 2013, 2:50:20 AM6/4/13
to mog...@googlegroups.com
Hi,

Thanks for the information. I'm interested to know if I can bulk insert a bunch of files in MogileFS. I have a distributed system with a lot of machines trying to write files to MogileFS simultaneously and I find the response to be quite slow. Are there any speed optimizations/configuration settings that can help me in this.

In my case, the distributed machines are writing to an in-memory system and a set of deferred write workers are clearing up the in-memory data to MogileFS. Since the writes are quite slow, this is leading to full memory situations. Would bulk-insert help this? Earlier I was using another DB, in which the speed was not an issue, so the data in memory stabilized over time and didnt lead to full-memory situation.

Thanks again in advance!

Eric Wong

unread,
Jun 4, 2013, 5:08:19 AM6/4/13
to mog...@googlegroups.com
Srinivasan Kidambi <kidambi...@gmail.com> wrote:
> Thanks for the information. I'm interested to know if I can bulk insert a
> bunch of files in MogileFS. I have a distributed system with a lot of
> machines trying to write files to MogileFS simultaneously and I find the
> response to be quite slow. Are there any speed optimizations/configuration
> settings that can help me in this.

You can upload many files in parallel.

How large are your files and how fast is your link? In other words,
I'm trying to narrow down what is causing your slowness.

For small files, persistent HTTP connections help (more as files get
smaller) since there's no need to go through the 3-way TCP handshake.
Persistent connections won't noticeably help if your files are large and
take several seconds/minutes to transfer, though.

> In my case, the distributed machines are writing to an in-memory system and
> a set of deferred write workers are clearing up the in-memory data to
> MogileFS. Since the writes are quite slow, this is leading to full memory
> situations. Would bulk-insert help this? Earlier I was using another DB, in
> which the speed was not an issue, so the data in memory stabilized over
> time and didnt lead to full-memory situation.

If you're uploading many large files over a fat pipe to Linux storage
nodes, you may end up with I/O stalls due to excessive buffering before
write-out.

I run my servers with >= 8G expecting sustained write I/O with
vm.dirty_background_ratio=1 and vm.dirty_ratio=2.

If I had even more RAM (and a non-ancient kernel), I'd use something
like vm.dirty_background_bytes=80000000 and vm.dirty_ratio=160000000 to
start background writes at 80M and _block_ writing processes if we hit
160M dirty. This trades fast bursty performance for _consistent_
performance

Please play around with these values for your hardware. The Linux
defaults are optimized for bursty workloads (and smaller machines).

(See the Linux kernel documentation for more info on these sysctls).

Back to small files...

Here's a ruby script I sometimes use for torture testing metadata
performance. It's serialized, but uses persistent connections
to speed up small uploads.

# Ruby MogileFS::Client will automatically use NHP if available with
# store_content (since we know the contents already fit in memory)
require 'net/http/persistent' # gem install net-http-persistent

require 'mogilefs' # gem install mogilefs-client
c = MogileFS::MogileFS.new hosts: %w(0:7500), domain: 'test'
begin
100000.times do |i|
i = i.to_s
c.store_content(i, "default", i)
end
puts "#{Time.now} done once"
rescue => e
warn "#{e.message} (#{e.class}) #{Time.now}"
sleep 1
end while true

Dave Lambley

unread,
Jun 4, 2013, 8:48:35 AM6/4/13
to mog...@googlegroups.com
Thanks for the information. I'm interested to know if I can bulk insert a bunch of files in MogileFS. I have a distributed system with a lot of machines trying to write files to MogileFS simultaneously and I find the response to be quite slow. Are there any speed optimizations/configuration settings that can help me in this.

In my case, the distributed machines are writing to an in-memory system and a set of deferred write workers are clearing up the in-memory data to MogileFS. Since the writes are quite slow, this is leading to full memory situations. Would bulk-insert help this? Earlier I was using another DB, in which the speed was not an issue, so the data in memory stabilized over time and didnt lead to full-memory situation.

The Perl Mogile client will by default load a file into RAM before uploading.  Passing the option largefile => 1 to store_file will stop it doing this which should reduce RAM usage and may help performance.

I also have an alternate Mogile client, https://metacpan.org/module/MogileFS::Client::CallbackFile intended to give faster write performance with more control over the upload progress.

Dave
Reply all
Reply to author
Forward
0 new messages