CouchDB: 2.0 & 1.6.1 database compatibility

Bogdan Andu

unread,

Oct 7, 2016, 3:49:39 AM10/7/16

to us...@couchdb.apache.org

Hello,

I configured a single-node CouchDB 2.0 instance and
I copied in data directory 1.6.1 couch databases.

But the databases does not show up in Fauxton, only the
test databases:

["_global_changes","_replicator","_users","verifytestdb"].

Is there a way to make CouchDB 2.0 read 1.6.1 couch files

without importing?

/Bogdan

max

unread,

Oct 7, 2016, 4:20:23 AM10/7/16

to us...@couchdb.apache.org

Hi,

Install 2.0 version on another server or just make it listen on different
port than 1.6 then replicate your data ;)

Bogdan Andu

unread,

Oct 7, 2016, 5:29:43 AM10/7/16

to us...@couchdb.apache.org

I see the data management is totally different(and better).
now there is a _dbs.couch for a registry-like database for databases
and actual databases are located in data/shards subdirectories.

so.. only replication works here..
and one can replicate many databases in parallel.

another difference I see is the size of databases.

2.0 version keep a very small size of databases compared to 1.6.1 version.

Is there any change in storage engine that makes so big differences in
database sizes?

all records in db1 in 1.6.1 have only one revision like (1-...) format

db1 in 1.6.1 is 2.5GB with 362849 records
after replication:
db1 in 2.0 has 69.3 MB with 362849 records

when is recommended to use design documents and when mango queries.
is mango intended to replace design documents although I assume both
build a view tree for the query in question.

which one is faster?
what are the use-cases for each one of the query methods?

Thanks,

Bogdan

max

unread,

Oct 7, 2016, 5:37:48 AM10/7/16

to us...@couchdb.apache.org

I cannot answer all your question but I know that the size of a database
after a replication is always smaller.
This is due to previous documents revisions remaining in source db but not
send to target during replication.

Max

Bogdan Andu

unread,

Oct 7, 2016, 5:49:44 AM10/7/16

to us...@couchdb.apache.org

may be I wasn't clear enough..

the docs in original db1 had no revisions, expect the initial one,
which doesn't count.

/Bogdan

Thanos Vassilakis

unread,

Oct 7, 2016, 7:56:16 AM10/7/16

to us...@couchdb.apache.org

Good questions

Sent from my iPhone

Adam Kocoloski

unread,

Oct 7, 2016, 9:43:51 AM10/7/16

to us...@couchdb.apache.org

Lots of good questions there.

On the storage size, note that even when you write only one revision of each document the database will accumulate some wasted space. Inserts to the database cause internal btree structures to be updated, and due to the copy-on-write nature of the storage engine the old btree nodes are left around in the file.

We did make some changes in the compaction system that produce smaller files at the end of the day. You can read more about those changes here - https://blog.couchdb.org/2016/08/10/feature-compaction/ <https://blog.couchdb.org/2016/08/10/feature-compaction/> - but they don’t explain the difference that you reported. Perhaps you didn’t compact the source database at all?

You are correct that both design documents and mango will build btree-based indexes to answer their queries. I would like to see us add functionality to mango over time so that it can cover the large majority of use cases where folks need to appeal to views in design documents, but we’re not quite there yet. One example where mango cannot help you today is reduce functions; if you want to aggregate the values in your index you need to drop down and build a view for that.

In terms of performance, mango should be moderately faster at building an index because there’s no JavaScript roundtrip. Querying performance should be ~identical. Cheers,

Adam

Bogdan Andu

unread,

Oct 10, 2016, 5:55:01 AM10/10/16

to us...@couchdb.apache.org

Hi,

I return with updated info :

I compacted db1 (CouchDB/1.6.1) on the source and now has 350 MB from 2.5 GB
with 362849 no. of documents
I also compacted the views but no difference in size .

The database stores documents of the following form:

{
"_id": "00006df04672a0c0e0da142ad8cd90b9",
"_rev": "1-a14afd34d5a52e3f6ae515c9adcff2d3",
"local_id": "110361",
"email": "schwar...@tee.schwarz",
"sent_date": "2007-06-29 12:20:31",
"regtype": "n"
}

Huge difference between 2.5GB and 350 MB and the
documents had no revisions.

If Couch is able to reduce a db's size to this magnitude after compaction
why cannot maintain the aprox. the same size limit during
normal operations(there are no deletions, no updates , only insertions).

Maybe the b-tree is optimized only after compaction, and not during
repetitive insertions

(aprox. 2000 insertions/day).

and for the sake of consistency..

after replication to 2.0 couchdb, the same database
(with views generated took ~ 20 minutes / 362849 docs), we have:

69.3 MB / 362849 documents

Now the big surprise is the huge difference in
size resulted after compaction on 1.6.1

to summarize :

(1) 1.6.1 original 2.5 GB 362849 docs

(2) 1.6.1 compacted 350 MB 362849 docs

(3) 2.0 replicate (from (1)) 69.3 MB 362849 docs

/Bogdan

Jan Lehnardt

unread,

Oct 10, 2016, 6:05:26 AM10/10/16

to us...@couchdb.apache.org

> On 10 Oct 2016, at 11:54, Bogdan Andu <bog...@gmail.com> wrote:
>
> Hi,
>
> I return with updated info :
>
> I compacted db1 (CouchDB/1.6.1) on the source and now has 350 MB from 2.5 GB
> with 362849 no. of documents
> I also compacted the views but no difference in size .
>
> The database stores documents of the following form:
>
> {
> "_id": "00006df04672a0c0e0da142ad8cd90b9",
> "_rev": "1-a14afd34d5a52e3f6ae515c9adcff2d3",
> "local_id": "110361",
> "email": "schwar...@tee.schwarz",
> "sent_date": "2007-06-29 12:20:31",
> "regtype": "n"
> }
>
> Huge difference between 2.5GB and 350 MB and the
> documents had no revisions.
>
> If Couch is able to reduce a db's size to this magnitude after compaction
> why cannot maintain the aprox. the same size limit during
> normal operations(there are no deletions, no updates , only insertions).

For CouchDB Compaction is considered normal operation.

>
> Maybe the b-tree is optimized only after compaction, and not during
> repetitive insertions
>
> (aprox. 2000 insertions/day).
>
> and for the sake of consistency..
>
> after replication to 2.0 couchdb, the same database
> (with views generated took ~ 20 minutes / 362849 docs), we have:
>
> 69.3 MB / 362849 documents
>
> Now the big surprise is the huge difference in
> size resulted after compaction on 1.6.1
>
>
> to summarize :
>
> (1) 1.6.1 original 2.5 GB 362849 docs
>
> (2) 1.6.1 compacted 350 MB 362849 docs
>
> (3) 2.0 replicate (from (1)) 69.3 MB 362849 docs

These numbers confirm the significant improvements that were done
to the compactor for 2.0. I’m glad it’s showing for you :)

https://blog.couchdb.org/2016/08/10/

Best
Jan
--

--
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Bogdan Andu

unread,

Oct 10, 2016, 6:42:33 AM10/10/16

to us...@couchdb.apache.org

but still does not explain the huge difference in size between (1) and (2)
given the fact that the docs are simple jsons and no document had
2 revisions ever.

Compaction means discarding all revisions but the latest(newest)

/Bogdan

Jan Lehnardt

unread,

Oct 10, 2016, 8:21:37 AM10/10/16

to us...@couchdb.apache.org

> On 10 Oct 2016, at 12:42, Bogdan Andu <bog...@gmail.com> wrote:
>
> but still does not explain the huge difference in size between (1) and (2)
> given the fact that the docs are simple jsons and no document had
> 2 revisions ever.

That is just how CouchDB works. It never overwrites any data it has on
disk, that includes all the btree nodes that are obviated with each
new document write.

>
> Compaction means discarding all revisions but the latest(newest)

Compaction also means removing unused btree nodes.

Bogdan Andu

unread,

Oct 10, 2016, 9:00:08 AM10/10/16

to us...@couchdb.apache.org

yes, I know , but couchdb storage engine cannot optimize
this while operating normaly. only after compaction is finished, the
database
is optimized.

I presume that the entire btree is traversed to detect revisions and unused
btree nodes.

I have no revisions on documents.

My case clear leans toward the unused nodes.

Couldn't be those nodes detected in a timely manner,
while inserting (appending to the end of file) documents , and be deleted
automatically?

But I assume that the btree must be traversed every time an insert is done
(or may be traversed from a few nodes above the last 100 or 1000 new
documents).

Now the problem consist in why and how those node become unusable?

What are the conditions necessary that db produces dead nodes?

If you could manage to avoid this I think you have a self-compacting
database.

Just my 2 cents.

just a side question.. wouldn't be nice to have multiple storage engines
that follow the same
replication protocol, of course

Jan Lehnardt

unread,

Oct 10, 2016, 9:36:22 AM10/10/16

to us...@couchdb.apache.org

> On 10 Oct 2016, at 14:59, Bogdan Andu <bog...@gmail.com> wrote:
>
> yes, I know , but couchdb storage engine cannot optimize
> this while operating normaly. only after compaction is finished, the
> database
> is optimized.
>
> I presume that the entire btree is traversed to detect revisions and unused
> btree nodes.
>
> I have no revisions on documents.
>
> My case clear leans toward the unused nodes.
>
> Couldn't be those nodes detected in a timely manner,
> while inserting (appending to the end of file) documents , and be deleted
> automatically?

we could do that, but then we’d open ourselves up for database corruption
during power-, hardware- or software-failures. There are sophisticated
techniques to safeguard against that, but they come with their own set
of trade-offs, one of which is code complexity. Other databases have
millions of lines of code in just this area and CouchDB is <100kLoC total.

> But I assume that the btree must be traversed every time an insert is done
> (or may be traversed from a few nodes above the last 100 or 1000 new
> documents).

Yes, for individual docs, it is each time, for bulk doc requests with
somewhat sequential doc ids, it is about per bulk size.

> Now the problem consist in why and how those node become unusable?
>
> What are the conditions necessary that db produces dead nodes?

As soon as a document (or set of docs in a bulk docs request) is written,
we stop referencing existing btree nodes up the tree in the particular
branch.

> If you could manage to avoid this I think you have a self-compacting
> database.
>
> Just my 2 cents.

Again, this is a significant engineering effort. E.g. InnoDB does what
you propose and it took 100s of millions of dollars and 10 years to get
up to speed and reliability. CouchDB does not have these kinds of resources.

>
> just a side question.. wouldn't be nice to have multiple storage engines
> that follow the same
> replication protocol, of course

We are working on this already :)

Bogdan Andu

unread,

Oct 12, 2016, 5:55:57 AM10/12/16

to us...@couchdb.apache.org

On Mon, Oct 10, 2016 at 4:36 PM, Jan Lehnardt <j...@apache.org> wrote:

>
> > On 10 Oct 2016, at 14:59, Bogdan Andu <bog...@gmail.com> wrote:
> >
> > yes, I know , but couchdb storage engine cannot optimize
> > this while operating normaly. only after compaction is finished, the
> > database
> > is optimized.
> >
> > I presume that the entire btree is traversed to detect revisions and
> unused
> > btree nodes.
> >
> > I have no revisions on documents.
> >
> > My case clear leans toward the unused nodes.
> >
> > Couldn't be those nodes detected in a timely manner,
> > while inserting (appending to the end of file) documents , and be deleted
> > automatically?
>
> we could do that, but then we’d open ourselves up for database corruption
> during power-, hardware- or software-failures. There are sophisticated
> techniques to safeguard against that, but they come with their own set
> of trade-offs, one of which is code complexity. Other databases have
> millions of lines of code in just this area and CouchDB is <100kLoC total.
>

> there is an interesting project called scalaris (http://scalaris.zib.de/)
that uses paxos commit protocol and
and algorithms borrowed from torrent technology but they do not store the
db on disk.
another interesting technology is hibary database that uses a concept of
bricks and virtual nodes.

> > But I assume that the btree must be traversed every time an insert is
> done
> > (or may be traversed from a few nodes above the last 100 or 1000 new
> > documents).
>
> Yes, for individual docs, it is each time, for bulk doc requests with
> somewhat sequential doc ids, it is about per bulk size.
>
> > Now the problem consist in why and how those node become unusable?
> >
> > What are the conditions necessary that db produces dead nodes?
>
> As soon as a document (or set of docs in a bulk docs request) is written,
> we stop referencing existing btree nodes up the tree in the particular
> branch.
>

but I think stop referencing the nodes does not means garbage-collecting
them

>
>
> > If you could manage to avoid this I think you have a self-compacting
> > database.
> >
> > Just my 2 cents.
>
> Again, this is a significant engineering effort. E.g. InnoDB does what
> you propose and it took 100s of millions of dollars and 10 years to get
> up to speed and reliability. CouchDB does not have these kinds of
> resources.
>
> >
> > just a side question.. wouldn't be nice to have multiple storage engines
> > that follow the same
> > replication protocol, of course
>
> We are working on this already :)
>

wow, and what are the candidates for alternative backends. I presume one of
them is
leveldb, because everybody has it. Even mnesia has it.

/Bogdan

Jan Lehnardt

unread,

Oct 12, 2016, 6:14:12 AM10/12/16

to us...@couchdb.apache.org

CouchDB 2.0 has a similar layer for in-cluster duplication and reliability.

You are asking about per-node reliability and on-disk storage and the
trade-offs I explained still apply there.

>
>
>>> But I assume that the btree must be traversed every time an insert is
>> done
>>> (or may be traversed from a few nodes above the last 100 or 1000 new
>>> documents).
>>
>> Yes, for individual docs, it is each time, for bulk doc requests with
>> somewhat sequential doc ids, it is about per bulk size.
>>
>>> Now the problem consist in why and how those node become unusable?
>>>
>>> What are the conditions necessary that db produces dead nodes?
>>
>> As soon as a document (or set of docs in a bulk docs request) is written,
>> we stop referencing existing btree nodes up the tree in the particular
>> branch.
>>
> but I think stop referencing the nodes does not means garbage-collecting
> them

That is correct, these nodes then are the garbage :)

>
>>
>>
>>> If you could manage to avoid this I think you have a self-compacting
>>> database.
>>>
>>> Just my 2 cents.
>>
>> Again, this is a significant engineering effort. E.g. InnoDB does what
>> you propose and it took 100s of millions of dollars and 10 years to get
>> up to speed and reliability. CouchDB does not have these kinds of
>> resources.
>>
>>>
>>> just a side question.. wouldn't be nice to have multiple storage engines
>>> that follow the same
>>> replication protocol, of course
>>
>> We are working on this already :)
>>
> wow, and what are the candidates for alternative backends. I presume one of
> them is leveldb, because everybody has it. Even mnesia has it.

We are still in the design phases of this from the CouchDB point of view.
Potential storage backends are next. I also expect at least a trial for
LevelDB, but no promises yet :)

Best
Jan
--

>
>
> /Bogdan

Reply all

Reply to author

Forward