Store a plenty of small (15Kb) files in mongodb

3,247 views
Skip to first unread message

Maxim Rusakov

unread,
Oct 13, 2011, 2:09:25 AM10/13/11
to mongodb-user
Hello,

we are developing a web search software that scan internet pages and
stores them as zipped html (no pictures) for archival purposes.
We used to store everything but html in MySql. Everage zipped html
page size is around 15-20 KB. We store them as plain linux files.

Now we are migrating to mongodb.

My question is:
* is it a good idea to store html as byte[] in a separate collection
inside mongodb database? or is it better to stay with using files and
store only path to the file?
* how much overhead (RAM, HDD size) will mongodb add comparing with
storing them in file system?

Volume is:
* 10 millions pages per year
* every size is 15 - 20 KB


Andreas Jung

unread,
Oct 13, 2011, 2:46:38 AM10/13/11
to mongod...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

It's perfectly fine to store html files in MongoDB as standard
utf-8 encoded strings.

- -aj

- --
ZOPYX Limited | zopyx group
Charlottenstr. 37/1 | The full-service network for Zope & Plone
D-72070 T�bingen | Produce & Publish
www.zopyx.com | www.produce-and-publish.com
- ------------------------------------------------------------------------
E-Publishing, Python, Zope & Plone development, Consulting


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQGUBAEBAgAGBQJOlolOAAoJEADcfz7u4AZjnGoLviulgGQUE3LVaNdk5B9sphDI
81XPLYtYMfcpS+THs4Ns4LfQ6cX3JES7WwzaKxteCgSlq3Pk6HpRb+gZ9UsMRuqP
3FkVMvdvthSEchF0ItFtL7Gmo/xCUKVonqh44sqVvwECFj4t9aoB1DgsZyX7AZpo
VoGUjbOUiXNjSwKV74cT1nd5DLJnHOlk9UfEZhsibucZGOywjzfdDkMvskHWrl5H
UpbLRhhCQK/rXq/KgnyFsSlZXgmOxIF15WSuFIzJk/krr1MVdj8U3R/WJU+VaKEE
rNltKbGAt+t2zzVvANu0qBMJijTqgag1VlTPy0puvEXG5SKQlYWiYuOtT24QZ6KA
zp0axfPVVJ99WB5Xu7uAa9X0YePLRN/cDTFelT8X9YkAA9eYqfAWK42bhtn4phFY
n70dfidMXOFqHNMR4ZFUEfbrswGU1T5OWNi22YAC2BvurC8ckWHiOBAwiSlqLuNl
YL+ifLqgA4PSje7fd2uJJT/XrYP7als=
=f5na
-----END PGP SIGNATURE-----

lists.vcf

Kyle Banker

unread,
Oct 13, 2011, 10:28:40 AM10/13/11
to mongod...@googlegroups.com, li...@zopyx.com
Storing small files in MongoDB can be a good idea as an organizational principle and to simplify queries. There will be a little more space overhead in MongoDB than on the filesystem, but if you have to store the metadata anyway, then this may not be much of a problem.

Best is to put together a test case.

Gavin Hogan

unread,
Oct 13, 2011, 11:44:04 AM10/13/11
to mongodb-user
If you are likely to use the html data at the same time as the meta
data then storing it together within a single document/collection
makes sense. If the document is not likely to be read as often as the
meta data you are collecting then you should not store it together.

Kyle, correct me if I am wrong about this but my understanding is that
Mongo must load the record to memory to allow you to read it and it
will load a full document into memory even if you are only reading a
small portion of it. This behavior is totally acceptable to me but it
is important to understand how this impacts your document design, you
do not want to have a collection made of documents where the majority
of the memory foot print is for data you have no reason to load, this
will drastically effect your workingset.
HTH
Gavin

Kyle Banker

unread,
Oct 13, 2011, 11:54:14 AM10/13/11
to mongodb-user
Gavin,

There's definitely some truth in what you say. OS page size is usually
4 KB, so
a 15 KB document will span 4 - 5 pages, depending on how it's aligned.
If you're grabbing
only the metadata, and that data is both small and abiding at the
beginning of the document,
then it's possible that only a single page will have to be loaded.
That may nevertheless be
excessive if the meta is often used exclusively.

Kyle
Reply all
Reply to author
Forward
0 new messages