Any experiences with gridfs limits ?

266 views
Skip to first unread message

Manu

unread,
Dec 17, 2010, 6:36:35 AM12/17/10
to mongodb-user, man...@miwim.fr
Hi everyone,

Well, we (the company were i work) are actually using mongodb for
several purposes... and feel very happy with it. One if these purposes
is using mongodb as cdn storage for product images on a Dual intel
xeon 2.6ghz box, with 12gb ram. A single php script queries the grid
(actually containing more than 400 000 documents) in order to deliver
through apache the stored image.
If the image isn't available in the grid, the php script connects to
another db, finds the original url, and does all the proccessing stuff
in order to finally store the processed image into the grid.

All this works fine... It only lags a little bit when heavy writes and
image processing are done on the server. My question now: Does anyone
have similar experiences with this kind of gridfs usage ? We are
planning to move another project which will need to store about 12
millions images on a grid. Will this be technically possible with only
one machine... or will this involve sharding ?

Manu

Eliot Horowitz

unread,
Dec 17, 2010, 10:10:22 AM12/17/10
to mongod...@googlegroups.com
Some questions:
- what is the average image size?
- how many queries/second do you need to handle?

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Maciej Dziardziel

unread,
Dec 17, 2010, 6:46:07 PM12/17/10
to mongodb-user
Important questions:

1. What is the size of working set (set of images that should be kept
in ram) comparing to ram?
If it smaller then your ram that might be an issue that limits speed
of serving files.

2. Combining apache and php might be a limitating factor. Take a look
at nginx and nginx-gridfs module.
That way you could achieve more reqs/s by using more efficient
webserver and eliminating php.

--
Maciej Dziardziel

Manu

unread,
Dec 19, 2010, 5:30:16 AM12/19/10
to mongodb-user
Elliot, Maciej, thanks for your support.

The average image size is very variable, but never exceeds 300kb
(images are stored in different sizes, 500px square ratio maximum).
The box should serve around 30/40 req/s at peak... (so arround 80-120
queries/s for mongodb).

Maciej, i've looked at nginx and it's gridfs module. It looks very
fast and promising... but does not include php (we need this to do the
image processing stuff... just once).

About the working set size... i really don't know. Images are randomly
called by visitors (arround 40k per day). Speed isn't the main factor.
It's okay if the image is served in 500ms.

I've made some calculations... the grid should potentialy store
arround 1.5 - 2TB files.

More informations: We use this system for one of our price-comparing
website. We analyze a bunch of different catalogs in order to extract
product titles, descriptions, prices... and image urls (each product
has an image).

The old way : once a catalog is processed, another script downloads
and resizes product images. The images are stored as files in folders
(path is created through a simple algorithm, and corresponds to a
given product id). The processing part (download-resize-store) is long
and painfull... Some catalogs contains more than 200k products. File
deletion and update is also painfull, you need a script, the product
db, the algorithm in order to get the product id. This is actually
used on a price-comparing site (12 million products, 1.1 Tb images).

The gridfs way : No more image processing script, a single box
contains contains the grid, apache and php. When a product page is
hit, the visitors browser calls an image (<img src="http://cdn.local/
p_1233.jpg?500" ...) which is a single php script (located on another
server... the cdn server). The script querys the local grid with the
passed product_id... if image has been found... image is served... if
not, the single image is downloaded and processed... and finally
stored in order to serve it later.

So... why is this so cool ?

- it's install and forget... we do not have to worry anymore about the
image import process.
- it's is easily maintainable. For example... we do not need images
from catalog xxx anymore. A single query will delete all images from
catalog_id xxx.
- image updates become easy... you just have to delete the file in the
grid... the next image call will initiate image download-processing-
storage...
- it's self recovering...if for any reason the cdn server is lost, we
just start another one... which will be refilled by the visitors image
calls...
- we are able to store hit counts... (but we don't do it at this
time...)
- If we need a new image size, we just "authorize it" in the php
script, and pass the new size in the url request (example:
http://cdn.local/p_1233.jpg?600)

So... we do this on a small price-comparing site : 400 000 images. It
works well. Can i hope implementing the same schema with one (bigger)
box... on a db which will potentially contain 12 million images in the
grid (and maybe more) ?

Eliot Horowitz

unread,
Dec 19, 2010, 9:23:04 AM12/19/10
to mongod...@googlegroups.com
That's a fairly typical setup that should work very well.
Just make sure to put enough capacity in the system so its not
under-powered to start.

Manuel Eidenberger

unread,
Jan 18, 2011, 5:07:53 AM1/18/11
to mongodb-user
Well, back again... i just wanted to give some feedback about this
setup.

We've installed our gridfs server with apache-php on top and it really
works very well. We have now more than one million images stored in
the box and it's still growing (the server is a small one, 2Gb of ram,
Core 2 duo 1.8Ghz and a 10TB Raid 0 array). The server never loads,
index size is ok (arround 230mb), write/lock percentage is arround
20%, and content is delivered very quickly (arround 50ms from server
to browser).

Final word: this is really works and is in production.

thanks again to all who helped,

Mongodb rocks...

On Dec 19 2010, 3:23 pm, Eliot Horowitz <eliothorow...@gmail.com>
wrote:
> That's a fairly typical setup that should work very well.
> Just make sure to put enough capacity in the system so its not
> under-powered to start.
>
> On Sun, Dec 19, 2010 at 5:30 AM, Manu <eidenber...@gmail.com> wrote:
> > Elliot, Maciej, thanks for your support.
>
> > The average image size is very variable, but never exceeds 300kb
> > (images are stored in different sizes, 500px square ratio maximum).
> > The box should serve around 30/40 req/s at peak... (so arround 80-120
> > queries/s for mongodb).
>
> > Maciej, i've looked at nginx and it'sgridfsmodule. It looks very
> > fast and promising... but does not include php (we need this to do the
> > image processing stuff... just once).
>
> > About the working set size... i really don't know. Images are randomly
> > called by visitors (arround 40k per day). Speed isn't the main factor.
> > It's okay if the image is served in 500ms.
>
> > I've made some calculations... the grid should potentialy store
> > arround 1.5 - 2TB files.
>
> > More informations: We use this system for one of our price-comparing
> > website. We analyze a bunch of different catalogs in order to extract
> > product titles, descriptions, prices... and image urls (each product
> > has an image).
>
> > The old way : once a catalog is processed, another script downloads
> > and resizes product images. The images are stored as files in folders
> > (path is created through a simple algorithm, and corresponds to a
> > given product id). The processing part (download-resize-store) is long
> > and painfull... Some catalogs contains more than 200k products. File
> > deletion and update is also painfull, you need a script, the product
> > db, the algorithm in order to get the product id. This is actually
> > used on a price-comparing site (12 million products, 1.1 Tb images).
>
> > Thegridfsway : No more image processing script, a single box

Andreas Jung

unread,
Jan 18, 2011, 5:11:58 AM1/18/11
to mongod...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Manuel Eidenberger schrieb:


> Well, back again... i just wanted to give some feedback about this
> setup.
>
> We've installed our gridfs server with apache-php on top

How did you integrate GridFS with Apache?

- -aj
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQGUBAEBAgAGBQJNNWduAAoJEADcfz7u4AZjX14Lvj7W+BVIWq75zjQi+NrRtMcp
XAtFtFB+UACg+wJVV0ixWXS1th/7HqjoCgiVd47ybANAaGMRv8XOcJ3Cq6/I+nez
H1VPODv9mpeRq+OdNiDp6Lp6ULG7EIp/bUHS64UbIPyNGzoAIB6K8U9LJrcIJ14F
NMzI7F+ZSgXCENwYXymg+hj+1tuhYpMIoP6NhV3mksx63GfXUJ+yKOXcPntTuqA8
1T8i0CqrB91uMKewh8COo9PpTiQFi0kETlMPGLxH6n9M2QxYusGh+M926ivAJaMD
K9xIw1W+7lLPJp4gzB9ygaGPc8FQlVV+MgCLWMZcAzokZGzOOgpNKOvz8xTehYiu
m6RSkzOUfIucR7Mmqi37w5V+RQcj+p6ia2bKldfjqQBm6PZ/djTDRzLOL4hsYxn6
5IfKiVESE2qfJsZqMOluiuBhMFcKB2qi/mJdM6NHxzqM5JLbU6VYs+7seFtv/3zD
/uqZ7owOdkZsGOg21KTrUZQ9wkjx81A=
=GEol
-----END PGP SIGNATURE-----

lists.vcf

Manuel Eidenberger

unread,
Jan 18, 2011, 6:07:39 AM1/18/11
to mongodb-user
Hi Andreas,

php and kristina's driver are used to interact with mongodb. Image is
outputed to browser with a simple echo $image->getBytes();

We also have a simple flushing system: the image is dropped and re-
downloaded by passing a simple arg in the image url.

Manu

On Jan 18, 11:11 am, Andreas Jung <li...@zopyx.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Manuel Eidenberger schrieb:
>
> > Well, back again... i just wanted to give some feedback about this
> > setup.
>
> > We've installed our gridfs server with apache-php on top
>
> How did you integrate GridFS with Apache?
>
> - -aj
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (Darwin)
> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/
>
> iQGUBAEBAgAGBQJNNWduAAoJEADcfz7u4AZjX14Lvj7W+BVIWq75zjQi+NrRtMcp
> XAtFtFB+UACg+wJVV0ixWXS1th/7HqjoCgiVd47ybANAaGMRv8XOcJ3Cq6/I+nez
> H1VPODv9mpeRq+OdNiDp6Lp6ULG7EIp/bUHS64UbIPyNGzoAIB6K8U9LJrcIJ14F
> NMzI7F+ZSgXCENwYXymg+hj+1tuhYpMIoP6NhV3mksx63GfXUJ+yKOXcPntTuqA8
> 1T8i0CqrB91uMKewh8COo9PpTiQFi0kETlMPGLxH6n9M2QxYusGh+M926ivAJaMD
> K9xIw1W+7lLPJp4gzB9ygaGPc8FQlVV+MgCLWMZcAzokZGzOOgpNKOvz8xTehYiu
> m6RSkzOUfIucR7Mmqi37w5V+RQcj+p6ia2bKldfjqQBm6PZ/djTDRzLOL4hsYxn6
> 5IfKiVESE2qfJsZqMOluiuBhMFcKB2qi/mJdM6NHxzqM5JLbU6VYs+7seFtv/3zD
> /uqZ7owOdkZsGOg21KTrUZQ9wkjx81A=
> =GEol
> -----END PGP SIGNATURE-----
>
>  lists.vcf
> < 1KViewDownload

Andreas Jung

unread,
Jan 18, 2011, 6:14:02 AM1/18/11
to mongod...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Manuel Eidenberger schrieb:


> Hi Andreas,
>
> php and kristina's driver are used to interact with mongodb. Image is
> outputed to browser with a simple echo $image->getBytes();
>
>

Ah...PHP...ic :-)

I've written zopyx_gridfs for the same purpose :)

http://pypi.python.org/pypi/zopyx_gridfs

- -aj
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQGUBAEBAgAGBQJNNXX6AAoJEADcfz7u4AZjR+oLwOM+2L4KRVx/U3EN4Lo40JcO
Wa0h2qgAHIj2GQt0P2dRLB29ozNLh/+aUpFSnd9q32S/RA/196Kmeaps4PGzuWmh
zM1kazEw4sQyA+UrSIziX7Q7IbuZqIjaa4LCM1UNXHQKVtnXk0MEh3BXZNmaOgur
K2AcgneCg9hy5uIkLajTdBvcT0wGpGMoe5D4VMay+V9h+WYgYTudfCsc+171oeqF
YZHd8w18WgyCtMNy2ympMfUZDglRQWyDuHUUAuV0MWtul2CmCnxJKOLEbj+ozgW2
gMapANBJM+tsNLyU/SFTpnXWIl6aeDph7g2Lmtn9JbdF3Ieu8kdvPXnhXT5R7nxY
ICCEmjCIgcTBdcY1N8GoFvd5CbM9xWZbYyq6TxZlpAxfszJfvdRtkjhnZin3CW38
OJ8qm7m/NnMO+sfgYCRveTEGUX5HgjRBLSq+AaDT+Lx4IdZKt1myQxPOwFmItJKr
PxESJ102xGh/Rjl8kwXjOMJYNfbWjgQ=
=gor6
-----END PGP SIGNATURE-----

lists.vcf
Reply all
Reply to author
Forward
0 new messages