Error getting security objects when configuring new replica

241 views
Skip to first unread message

Carlos Alonso

unread,
Jun 13, 2017, 1:34:42 PM6/13/17
to user
Hi guys!

I continue trying to understand how CouchDB clusters work and trying to
build a compelling administration tool that covers basic operations such as
adding a node to the cluster, moving a shard from one node to another and
so on. It is WIP but already open sourced here:
https://github.com/cabify/couchdb-admin

Testing the scale out procedure (add node, make it replicate some shards,
remove the shard from the previous location) I've seen the following error
:

[error] 2017-06-13T15:58:22.299140Z cou...@couch-2.couchdb2-replica-admin
<0.2214.3> -------- Error getting security objects for <<"testdb3">>:
{error,no_majority}


Not only mentioning my testdb3 but also with internal ones such as
_global_changes. I mean, I was scaling out testdb3, but errors appeared
referring to testdb3 and also _global_changes, but I wasn't scaling out
_global_changes.


The error appears when I configure a new node as being replica for an
existing shard (by adding it to the by_nodes and by_ranges sections of
document at _dbs/testdb3)


The error appears every few seconds on the new replica logs once for each
of the other replicas (3 for testdb3 and 2 for _global_changes at that
time) and it also appears on the other nodes' logs but just once every few
seconds.


The error stops appearing once I remove the maintenance_mode flag on the
new replica (because before configuring it as replica I enable that flag so
the node doesn't participate in reads. Kudos Adam Kocoloski for your advice
here) once pending_changes messages stop appearing on the new replica.

I think the error is making the catch_up process not to work properly as my
consistency checks fail when this error appears during the procedure
(doesn't happen 100% of the times).

I've seen it both happening when the new replica node was completely empty
but also when it had the data preloaded (via rsync or because it had
previously been a replica).


I hope so many text helps you out :)

Thanks!


--
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Carlos Alonso*
Data Engineer
Madrid, Spain

carlos...@cabify.com

Prueba gratis con este código
#CARLOSA6319 <https://cabify.com/i/carlosa6319>
[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image:
Linkedin] <https://www.linkedin.com/in/mrcalonso>

--
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
destinatario, pudiendo contener información confidencial sometida a secreto
profesional. No está permitida su reproducción o distribución sin la
autorización expresa de Cabify. Si usted no es el destinatario final por
favor elimínelo e infórmenos por esta vía.

This message and any attached file are intended exclusively for the
addressee, and it may be confidential. You are not allowed to copy or
disclose it without Cabify's prior written authorization. If you are not
the intended recipient please delete it from your system and notify us by
e-mail.

Carlos Alonso

unread,
Jun 14, 2017, 12:58:12 PM6/14/17
to user
Ok, so I've made some progress on this and I'd like to share it here.

So the error says "*Error getting security objects for
<<"affected_database_here">> : {error,no_majority}*" and that is actually
not related to configuring a new replica node as I was saying before but to
nodes in maintenance mode when read/write operations happen.

In summary, having less than half of the replica nodes available for the
database you're working on raises this error. The database is available
though (maximum availability by design I guess :))

My question then is, what does this error exactly mean? What are the so
called security objects? Is it something one has to carefully consider
avoiding?

Thank you.

On Tue, Jun 13, 2017 at 7:34 PM Carlos Alonso <carlos...@cabify.com>
wrote:

Adam Kocoloski

unread,
Jun 14, 2017, 1:57:25 PM6/14/17
to us...@couchdb.apache.org
Hi Carlos,

Ah, this is an interesting edge case. The "security object” contains the “admins” and “members” metadata for a database. For historical reasons it is *not* versioned like a normal document. Under normal operating circumstances every replica of every shard contains a copy of the security object for the database.

When you add a replica for an existing shard that replica does not yet have the security object. There is an internal process running in the cluster that regularly ensures that the security objects for a database are in sync. That process has a safeguard that will cause it to bail out and do nothing unless it recovers a simple majority of the security objects for all shard replicas of the database in question. Your statement that “having less than half of the replica nodes available for the database … raises this error” is almost correct; technically, what causes this error is when the cluster is unable to contact a majority of the *shard replicas*, regardless of which nodes are hosting them.

Hopefully this is an unusual scenario. That said, we could think about improving the cluster’s behavior here by allowing the security synchronization process to “punch through” maintenance mode and retrieve the security objects from those shards for the purposes of establishing a majority and subsequently converging all the shards. I think that’s worth further discussion in a GitHub issue at least.

Cheers, Adam

> On Jun 14, 2017, at 12:57 PM, Carlos Alonso <carlos...@cabify.com> wrote:
>
> Ok, so I've made some progress on this and I'd like to share it here.
>
> So the error says "*Error getting security objects for
> <<"affected_database_here">> : {error,no_majority}*" and that is actually
> not related to configuring a new replica node as I was saying before but to
> nodes in maintenance mode when read/write operations happen.
>
> In summary, having less than half of the replica nodes available for the
> database you're working on raises this error. The database is available
> though (maximum availability by design I guess :))
>
> My question then is, what does this error exactly mean? What are the so
> called security objects? Is it something one has to carefully consider
> avoiding?
>
> Thank you.
>
> On Tue, Jun 13, 2017 at 7:34 PM Carlos Alonso <carlos...@cabify.com <mailto:carlos...@cabify.com>>
> [image: Cabify - Your private Driver] <http://www.cabify.com/ <http://www.cabify.com/>>
>
> *Carlos Alonso*
> Data Engineer
> Madrid, Spain
>
> carlos...@cabify.com <mailto:carlos...@cabify.com>
>
> Prueba gratis con este código
> #CARLOSA6319 <https://cabify.com/i/carlosa6319 <https://cabify.com/i/carlosa6319>>
> [image: Facebook] <http://cbify.com/fb_ES <http://cbify.com/fb_ES>>[image: Twitter]
> <http://cbify.com/tw_ES <http://cbify.com/tw_ES>>[image: Instagram] <http://cbify.com/in_ES <http://cbify.com/in_ES>>[image:
> Linkedin] <https://www.linkedin.com/in/mrcalonso <https://www.linkedin.com/in/mrcalonso>>

Carlos Alonso

unread,
Jun 19, 2017, 8:02:16 AM6/19/17
to us...@couchdb.apache.org
Thanks a lot again for your input Adam.

Following on your comments I've just opened a GH issue with the details:
https://github.com/apache/couchdb/issues/602

Regards
Reply all
Reply to author
Forward
0 new messages