Using CouchDB as a file system server side

1,763 views
Skip to first unread message

Etay Haun

unread,
Jun 21, 2016, 3:44:19 PM6/21/16
to us...@couchdb.apache.org
Hi,
Thanks for your answers to my last post. It was very helpful.

We are developing a distributed file system solution and we would like to
base our solution on CouchDB.
We would like to use CouchDB to store the files as attachments (each
document will include the file and the file meta-data).
We have a few data centers that stores *different* file systems, Although
some of the documents are replicated to other data centers.
We have a few questions regarding possible technical issues.
As mentioned, Part of our possible solution involves using attachments to
store the actual files in couchdb.
1. We couldn't find any information regarding suggested attachment size.
2. Is there an issue with storing large attachments? (up to 2GB per file -
although most files will be much smaller - few KB or MB)
3. We need to replicate some documents between couch instances including
the attachments, Is this okay?
4. Does CouchDB also stores revisions of attachments?
5. If so, how can we determine the required storage space for an instance
assuming we know what will be the entire system's size?
Our biggest instance will include 20TB of attachments.
6. Are there any possible issues with running the instances on Windows 2012
servers?
Thank you in advance.

Alexander Harm

unread,
Jun 21, 2016, 4:29:33 PM6/21/16
to us...@couchdb.apache.org
Hello Etay,

npm did that at one point and they have a couple of articles in their blog that might be of your interest:

http://blog.npmjs.org/post/71267056460/fastly-manta-loggly-and-couchdb-attachments <http://blog.npmjs.org/post/71267056460/fastly-manta-loggly-and-couchdb-attachments>
http://blog.npmjs.org/post/75707294465/new-npm-registry-architecture <http://blog.npmjs.org/post/75707294465/new-npm-registry-architecture>

They experienced problems with storing a lot of attachments in CouchDB and moved to another solution. Also note this post of Nolan Lawson, point 4:

https://pouchdb.com/2014/06/17/12-pro-tips-for-better-code-with-pouchdb.html <https://pouchdb.com/2014/06/17/12-pro-tips-for-better-code-with-pouchdb.html>

I especially love the quote of Laurie Voss:

"One of the big things that everybody who's spent a lot of time with databases knows is that you should never put your binaries in the database. It's a terrible idea. It always goes wrong. I have never met a database in 15 years of which it is not true, and it's definitely not true of CouchDB.
You are taking this thing which is meant to sort and organize data, and you're giving it binary data, which it can neither sort nor organize. It can't do anything with that data, other than get really fat.”

My advice: DON’T.

Regards, Alexander

Brad Rhoads

unread,
Jun 21, 2016, 4:48:09 PM6/21/16
to us...@couchdb.apache.org
I'll second that. It didn't work out well for us. It's probably OK for
small, plain text documents. But it didn't work too well with large media
files.


---------------------------
www.maf.org/rhoads
www.ontherhoads.org

Kevin Coombes

unread,
Jun 21, 2016, 5:11:07 PM6/21/16
to us...@couchdb.apache.org
Have you thought about a two-part solution? You can use Couch for the
front end to store the metadata (making it searchable in lots of
interesting ways) with a separate data store behind it. Along with the
metadata, each CouchDB document would hold a URI that points to the
actual file somewhere else. You can even mix-and-match back-ends,
including straight HTTP or FTP servers as well as subversion or git. (We
started implementing this idea to store various kinds of
genomics/genetics/transcriptoimics data before I left M.D. Anderson a
couple of years ago. We got far enough to know that it is at least
somewhat more than just theoretically possible. It never got finished,
however, since after I left there was no one to push hard for it....
Kevin

Michael Zedeler

unread,
Jun 21, 2016, 5:31:19 PM6/21/16
to us...@couchdb.apache.org

On 2016-06-21 22:29, Alexander Harm wrote:
> My advice: DON’T.
I agree with this. We tried using CouchDB for image storage because the
images logically "belonged" to some data entities that were in Couch.
The benefit was that if you deleted an entity, the images would be
deleted too, but the drawbacks were far greater than the benefits. The
most important one was that the database grew so large that syncing it
took so long that it wasn't practical any more. The CouchDB sync
protocol just isn't geared towards moving a very large number of
binaries quickly.

To illustrate how unwieldy it can become, imagine replacing the CouchDB
component with a tar.gz-server that can read and write to a single
tar.gz-archive. You really don't gain anything apart from having
bottlenecks in odd places.

Regards,

Michael.

--
Michael Zedeler
70 25 19 99
mic...@zedeler.dk

dk.linkedin.com/in/mzedeler | twitter.com/mzedeler | github.com/mzedeler

Mike Marino

unread,
Jun 21, 2016, 5:31:53 PM6/21/16
to us...@couchdb.apache.org
Large files in the DB didn't work well for us, either.

We ended up putting together a solution where we stored large files on the
file server and simply associated them with documents in different
databases. We used nginx upfront to forward the normal CouchDB attachment
requests to our server handling the files so that the user could use the
normal Couch endpoints. It works well for us (we have generally files up
to ~10 GBs) and if you'd like to have a look at how to do it, here's a link
to some docs:

http://nedm-tum.github.io/FileServer-Docker/
https://github.com/nEDM-TUM/FileServer-Docker

Cheers,
Mike

On Tue, Jun 21, 2016 at 11:10 PM, Kevin Coombes <kevin.r...@gmail.com>
wrote:

Jason Smith

unread,
Jun 22, 2016, 4:20:54 AM6/22/16
to us...@couchdb.apache.org
On Wed, Jun 22, 2016 at 3:29 AM, Alexander Harm <con...@aharm.de> wrote:

> I especially love the quote of Laurie Voss:
>
> "One of the big things that everybody who's spent a lot of time with
> databases knows is that you should never put your binaries in the database.
> It's a terrible idea. It always goes wrong. I have never met a database in
> 15 years of which it is not true, and it's definitely not true of CouchDB.
> You are taking this thing which is meant to sort and organize data, and
> you're giving it binary data, which it can neither sort nor organize. It
> can't do anything with that data, other than get really fat.”
>

This is a pet peeve of mine.

"Do not store files in the database."

That simplistic mantra is up there with "[about regular expressions], now
you have two problems." It is so simple, so naïve. It can be over-done, or
under-analyzed.

Some thoughts.

1. All large systems experience growing pains from off-the-shelf tools. All
large projects must become custom. Twitter's growth is not an indictment of
Ruby on Rails. Google's growth is not an indictment of the ext filesystem.

2. Attachments in CouchDB are very simple. Simple is easy to ship. Simple
is easy to maintain. Simple projects allow you to focus on other problems.

3. Attachments in CouchDB make less sense as a project matures. Yes, I said
it. But just be careful not to prematurely optimize.

Once you really grow, you will be investigating CDNs, and caching tools,
and all sorts of alternatives. But how did you become that successful? How
did you come to this "good problem"? Because you shipped, and you
delivered, and you scaled. And at long last, you have earned the privilege
of building something large and complex.

Laurie Voss joined npm and made that quote some years after npm came
online. It had been growing exponentially for a few years. I remember
clearly: npm was already processing 1,000 requests per second, all on
plain-Jane CouchDB, with a very simple design.

"You didn't build that," as the man says. In my opinion, attachments played
a part in npm's early success, and moving away from them played a part in
npm's latter success. Both were wise decisions.

Jason Smith

unread,
Jun 22, 2016, 4:27:51 AM6/22/16
to us...@couchdb.apache.org
And a quick answer to the OP:

Right, if you plan to store 20TB of attachments, and *especially* if you
will replicate entire databases around, then I expect that CouchDB
attachments will be inappropriate.

Of course, now you must take responsibility for replicating 20TB around
your site, potentially tracking revisions and deletions. That will add
complexity to your project. But if you are undertaking a distributed
filesystem, I imagine that you planned for a bit of complexity :)

On Wed, Jun 22, 2016 at 3:20 PM, Jason Smith <jason....@gmail.com>
wrote:

Aurélien Bénel

unread,
Jun 22, 2016, 4:56:40 AM6/22/16
to us...@couchdb.apache.org
Hi folks,

> *especially* if you will replicate entire databases around

Jason, you mentioned the important criterion: the replication filtering policy.

As an example, we’re very pleased with attachments replication in settings where documents are personal or shared with a few other people.

Etay, you should evaluate the number and the size of replicated documents (per day?) based on the estimated number of users, of created documents, on the estimated share factor and attachment average size.
If it seems tractable, just begin with the easy solution, if it’s not, find another.


Regards,

Aurélien

Jan Lehnardt

unread,
Jun 22, 2016, 5:21:13 AM6/22/16
to us...@couchdb.apache.org
I second Jason’s observation. Attachment support can be crucial in getting
something off the ground and proving an idea. That will then usually unlock
the resources to scale things up as needed.

As a rule of thumb:

Don’t ever trust one-line truisms in technology (including this one).

Best
Jan
--
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Mike

unread,
Jun 22, 2016, 5:39:19 AM6/22/16
to us...@couchdb.apache.org
Thirded - we have been using couch with attachments for over 5 years and
replication with no problems. Our use case is much simpler with about
10% of documents in couch having attachments (mostly pdfs under 1Mb).

It would have taken us longer and with a lot more moving parts and
master/master filesystem is not as simple to setup and maintain as couch
replication.

Mike.
Reply all
Reply to author
Forward
0 new messages