Compression of files in the database

NS du Toit

unread,

Aug 6, 2013, 8:11:43 AM8/6/13

to ctjug...@googlegroups.com

Hi,

Does anyone have any experience regarding the compression of files to the database? If you need to store a file inside the database - do you store that as compressed or not? I'm used to leaving it uncompressed, but just want to think about it a bit more in case its not the best way (eg its implemented that way because someone didn't have time to investigate compression)

I guess its more CPU-intensive to compress and decompress it all the time - but then when you write to the HDD at least it will be less to write (and from what I hear that usually HDDs are performance bottlenecks as they operate synchronously, versus eg things cached in memory that can be accessed by different threads at the same time).

Is this the general tradeoff, or should I just leave all binaries uncompressed? (or alternatively implement it with a "type" column so that I can migrate between the two and different compression formats at will)

Also what specific compression implementation do you use?

Many thanks :)

Serdyn du Toit

Bevan Williams

unread,

Aug 6, 2013, 8:19:55 AM8/6/13

to ctjug...@googlegroups.com

Hi

This is a great question. I am dealing with something similar and it would be great to know what others out there are using.

For my problem, we are storing many chunks of XML monitoring data in a MySQL database (recently switched engine from legacy use of MyISAM to InnoDb). As volume on our system has increased, the disk volume needed is growing at a ridiculous rate.

We recently started experimenting with the TokuDb engine with it's high compression option enabled on the necessary tables. The results thus far seem quite favourable and have not noticed too much of a performance hit at all - for our requirement, performance is not critical for this table. It is quite easy to setup and migrate current MyISAM or InnoDB tables to TokuDb with compression.

Hope this helps in some way or another. Hope even more the others on here have more to add.

Regards

Bevan Williams

--
--
You received this message because you are subscribed to the Google Groups "CTJUG Forum" group.
To post to this group, send email to CTJUG...@googlegroups.com
To unsubscribe from this group, send email to CTJUG-Forum...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/CTJUG-Forum
For the ctjug home page see http://www.ctjug.org.za
For jobs see http://jobs.gamatam.com/
---
You received this message because you are subscribed to the Google Groups "CTJUG Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ctjug-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Gary Jacobson

unread,

Aug 6, 2013, 9:05:50 AM8/6/13

to ctjug...@googlegroups.com

Hi Serdyn

I personally never store files in the database - I store them on the filesystem and store the path in the database. Obviously your needs may be different from mine, but my reasoning is:

- Move load from the database server to the operating system.

- Make it easier for a developer to access the files or perform incremental backups.

- Reduce database size.

Regarding compression, I've used java.util.zip.GZIPOutputStream and it seems pretty efficient.

Cheers
Gary

Craig Mason-Jones

unread,

Aug 6, 2013, 9:09:48 AM8/6/13

to ctjug...@googlegroups.com

Hi Gary,

The BIG downside of storing files in the OS is that you can't
shard/scale your server unless you then proceed to copy the files as
well. Fine if you've got a single webserver, but you start having to do
a whole lot if you want to have 2 or more webservers. Of course, all
depends on the application. Perhaps a neater alternative is to store
files on S3 or something similar. Then you can also use the CDN features
to get faster delivery, without having to worry about this yourself
(although it all comes with a cost).

Just my 2c worth :-)

C

Moandji Ezana

unread,

Aug 6, 2013, 9:46:13 AM8/6/13

to ctjug...@googlegroups.com

On Tue, Aug 6, 2013 at 3:09 PM, Craig Mason-Jones <cr...@lateral.co.za> wrote:

The BIG downside of storing files in the OS is that you can't shard/scale your server unless you then proceed to copy the files as well.

Why would you have to do that? Couldn't you have a file server accessed by multiple web servers in the same way you would with a cache server, a DB server, etc?

Chris Ritchie

unread,

Aug 6, 2013, 9:58:38 AM8/6/13

to ctjug...@googlegroups.com

I too would choose not to store files in the database, rather just save a reference to their location in the database.

You can cache your resources using Apache Jackrabbit if you are worried about scalability.

Disk storage is cheap and disk i/o performs better compared to hitting the database for each request to serve up a static file.

--

David Tinker

unread,

Aug 7, 2013, 2:19:07 AM8/7/13

to ctjug...@googlegroups.com

We store compressed data in the db where possible (e.g. when the data
is text and does not need to be searched). I think the CPU time spent
decompressing the data on read will generally be less than the IO time
required to fetch more disk blocks + the data will use less buffer
space on the db server. Also its easy to add more app servers if you
need more CPU for the decompression and harder to scale the db.

> --
> --
> You received this message because you are subscribed to the Google Groups
> "CTJUG Forum" group.
> To post to this group, send email to CTJUG...@googlegroups.com
> To unsubscribe from this group, send email to
> CTJUG-Forum...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/CTJUG-Forum
> For the ctjug home page see http://www.ctjug.org.za
> For jobs see http://jobs.gamatam.com/
> ---
> You received this message because you are subscribed to the Google Groups
> "CTJUG Forum" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ctjug-forum...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration

Chris Ritchie

unread,

Aug 7, 2013, 3:07:44 AM8/7/13

to ctjug...@googlegroups.com

Research by Microsoft states:

"objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors."

http://research.microsoft.com/pubs/64525/tr-2006-45.pdf

The only real benefits I see from storing files in the database is ACID consistency.

If you choose to store your files in the database you can be assured of the following issues:

Possible lack of portability between database vendors, depending on system specific features you use

Increase complexity of serving up files

Inability to take advantage of high redundancy cloud storage

Decreased performance on files greater than 256K

Increased database size

Chris Ritchie

unread,

Aug 7, 2013, 3:56:01 AM8/7/13

to ctjug...@googlegroups.com

Of course these stats are only true if you are using Microsoft products. I think it is fair to say, if you are using MySQL then you will experience reduced database performance compared with MSSQL, and if you are using linux file systems, you can expect slower read/write performance on larger files compared with NTFS.

http://people.cs.clemson.edu/~jzwang/pub/fbench.pdf

NS du Toit

unread,

Aug 7, 2013, 6:33:56 AM8/7/13

to ctjug...@googlegroups.com

Hi,

Thanks for all the feedback - much appreciated.

@Bevan:

> For my problem, we are storing many chunks of XML monitoring data in a MySQL database (recently switched

> engine from legacy use of MyISAM to InnoDb). As volume on our system has increased, the disk volume

> needed is growing at a ridiculous rate.

Partitioned? Eg a database table per month of data? Then you can just throw away old tables when they are no longer required.

@Gary

> Regarding compression, I've used java.util.zip.GZIPOutputStream and it seems pretty efficient.

Thanks, will have a look :)

@All

I'm thinking of storing the files in the database for now. I will however store it in a separate database so that I can move it to its own machine later if required.

Viewing a database as just a cute API to the filesystem - then this is basically storing it on the filesystem... If I store it on the filesystem then that would imply that the reduced throughput due to the database (API) is large enough to make it worthwhile to investigate a different approach.

Storing the files in the database will therefore probably reduce the throughput, but at least I can have one backup strategy for now.

@craigmj

> The BIG downside of storing files in the OS is that you can't shard/scale your server unless you then

> proceed to copy the files as well.

Good point. But I guess IF the path is stored in the database, then that will serve as a type of lookup table and one can migrate the files one at a time. But yeah, a lot of things that will need to happen each time the architecture changes (which I think was the point you were trying to bring across)

@David

> We store compressed data in the db where possible (e.g. when the data

> is text and does not need to be searched). I think the CPU time spent

> decompressing the data on read will generally be less than the IO time

> required to fetch more disk blocks + the data will use less buffer

> space on the db server. Also its easy to add more app servers if you

> need more CPU for the decompression and harder to scale the db.

Thanks, my thinking as well. "when the data is text and does not need to be searched" - good point, even though I'm sticking with the database for now I do acknowledge that there isn't really a big functional benefit in having it in the database over the filesystem (its just for easy scalability and management for now, and most people would most likely have a database API already written so nothing new needs to be done)

Many thanks all - appreciated :)

Serdyn du Toit

Craig Newton

unread,

Aug 7, 2013, 7:35:05 AM8/7/13

to ctjug...@googlegroups.com

Hi Serdyn,

I wouldn't recommend storing the files in the database as it grows very quickly and can start to clutter up the database and make it run extremely slow. An alternative is to use the filesystem and have a GFS shared cluster for storage, see http://en.wikipedia.org/wiki/GFS2. It allows all nodes to have direct concurrent access to the same shared block storage. All nodes in a GFS cluster function as peers. Best of all the software is free and has a GPL license.

I would recommend storing only the meta data and path to the file in the database. Believe me backups on the database side will become a headache. Using GFS is much easier and has a few great backup plans that can be implemented.

Hope this helps.

Best Regards,

Craig Newton

NS du Toit

unread,

Aug 8, 2013, 5:13:33 AM8/8/13

to ctjug...@googlegroups.com

Hi Craig,

> It allows all nodes to have direct concurrent access to the same shared block storage.

Good point. Okay, I'm convinced :)

I guess for a system with not too many files and rare file access the database approach should still be okay though. And even if the few files become more used one can just cache it in memory and also get concurrent access.

But yeah, I want to write software that scales. Will definitely have a look at GSF2, thanks :)

Serdyn du Toit

Reply all

Reply to author

Forward