My last email was probably too much of a brain dump so my apologies for
that. I figured I'd try again with a more simple version of one piece of
the puzzle I'm trying to put together.
In a BigCouch cluster is it expected behavior that databases across members
of the cluster would be in a permanent state of having differing document
counts, or does that indicate a problem?
To put it another way, if I know for a fact no activity is occurring on a
particular database should I expect that the document counts across cluster
members would match, or is that not necessarily expected?
> My last email was probably too much of a brain dump so my apologies for
> that. I figured I'd try again with a more simple version of one piece of
> the puzzle I'm trying to put together.
> In a BigCouch cluster is it expected behavior that databases across
> members of the cluster would be in a permanent state of having differing
> document counts, or does that indicate a problem?
> To put it another way, if I know for a fact no activity is occurring on a
> particular database should I expect that the document counts across cluster
> members would match, or is that not necessarily expected?
> I would expect them to match unless internal replication is not happening
> correctly. Our doc counts on our shards are the same.
> Shawn
> On Oct 28, 2012 11:05 AM, "Matthew Woodward" <m...@mattwoodward.com>
> wrote:
>> My last email was probably too much of a brain dump so my apologies for
>> that. I figured I'd try again with a more simple version of one piece of
>> the puzzle I'm trying to put together.
>> In a BigCouch cluster is it expected behavior that databases across
>> members of the cluster would be in a permanent state of having differing
>> document counts, or does that indicate a problem?
>> To put it another way, if I know for a fact no activity is occurring on a
>> particular database should I expect that the document counts across cluster
>> members would match, or is that not necessarily expected?
Check and see if your validate_doc_update is perhaps checking for a
specific user context that isn't set during internal replication as
that's usually the cause of this.
On Sun, Oct 28, 2012 at 3:41 PM, Matthew Woodward <m...@mattwoodward.com> wrote:
> Thanks -- we're only seeing the count discrepancies on a specific DB so I'll
> probably wind up nuking the "bad" shards. Appreciate the info.
> On Sun, Oct 28, 2012 at 12:35 PM, Shawn Parrish <sparr...@nodeping.com>
> wrote:
>> I would expect them to match unless internal replication is not happening
>> correctly. Our doc counts on our shards are the same.
>> Shawn
>> On Oct 28, 2012 11:05 AM, "Matthew Woodward" <m...@mattwoodward.com>
>> wrote:
>>> My last email was probably too much of a brain dump so my apologies for
>>> that. I figured I'd try again with a more simple version of one piece of the
>>> puzzle I'm trying to put together.
>>> In a BigCouch cluster is it expected behavior that databases across
>>> members of the cluster would be in a permanent state of having differing
>>> document counts, or does that indicate a problem?
>>> To put it another way, if I know for a fact no activity is occurring on a
>>> particular database should I expect that the document counts across cluster
>>> members would match, or is that not necessarily expected?
I do have a validate_doc_update in both the databases in question and it's
checking for either the user/password I created for this database or
'admin' as the user context.
Does that need to include another user context for replication to work
properly?
Thanks!
On Mon, Oct 29, 2012 at 10:31 AM, Paul Davis <paul.joseph.da...@gmail.com>wrote:
> Check and see if your validate_doc_update is perhaps checking for a
> specific user context that isn't set during internal replication as
> that's usually the cause of this.
> On Sun, Oct 28, 2012 at 3:41 PM, Matthew Woodward <m...@mattwoodward.com>
> wrote:
> > Thanks -- we're only seeing the count discrepancies on a specific DB so
> I'll
> > probably wind up nuking the "bad" shards. Appreciate the info.
> > On Sun, Oct 28, 2012 at 12:35 PM, Shawn Parrish <sparr...@nodeping.com>
> > wrote:
> >> I would expect them to match unless internal replication is not
> happening
> >> correctly. Our doc counts on our shards are the same.
> >> Shawn
> >> On Oct 28, 2012 11:05 AM, "Matthew Woodward" <m...@mattwoodward.com>
> >> wrote:
> >>> My last email was probably too much of a brain dump so my apologies for
> >>> that. I figured I'd try again with a more simple version of one piece
> of the
> >>> puzzle I'm trying to put together.
> >>> In a BigCouch cluster is it expected behavior that databases across
> >>> members of the cluster would be in a permanent state of having
> differing
> >>> document counts, or does that indicate a problem?
> >>> To put it another way, if I know for a fact no activity is occurring
> on a
> >>> particular database should I expect that the document counts across
> cluster
> >>> members would match, or is that not necessarily expected?
You should short circuit the validation if the user has the "_admin"
role. Once you have that set then you should run an
_all_docs?include_docs=true over the database and it should fix itself
up.
On Mon, Oct 29, 2012 at 1:35 PM, Matthew Woodward <m...@mattwoodward.com> wrote:
> Ah-HA! Interesting stuff!
> I do have a validate_doc_update in both the databases in question and it's
> checking for either the user/password I created for this database or 'admin'
> as the user context.
> Does that need to include another user context for replication to work
> properly?
> Thanks!
> On Mon, Oct 29, 2012 at 10:31 AM, Paul Davis <paul.joseph.da...@gmail.com>
> wrote:
>> Check and see if your validate_doc_update is perhaps checking for a
>> specific user context that isn't set during internal replication as
>> that's usually the cause of this.
>> On Sun, Oct 28, 2012 at 3:41 PM, Matthew Woodward <m...@mattwoodward.com>
>> wrote:
>> > Thanks -- we're only seeing the count discrepancies on a specific DB so
>> > I'll
>> > probably wind up nuking the "bad" shards. Appreciate the info.
>> > On Sun, Oct 28, 2012 at 12:35 PM, Shawn Parrish <sparr...@nodeping.com>
>> > wrote:
>> >> I would expect them to match unless internal replication is not
>> >> happening
>> >> correctly. Our doc counts on our shards are the same.
>> >> Shawn
>> >> On Oct 28, 2012 11:05 AM, "Matthew Woodward" <m...@mattwoodward.com>
>> >> wrote:
>> >>> My last email was probably too much of a brain dump so my apologies
>> >>> for
>> >>> that. I figured I'd try again with a more simple version of one piece
>> >>> of the
>> >>> puzzle I'm trying to put together.
>> >>> In a BigCouch cluster is it expected behavior that databases across
>> >>> members of the cluster would be in a permanent state of having
>> >>> differing
>> >>> document counts, or does that indicate a problem?
>> >>> To put it another way, if I know for a fact no activity is occurring
>> >>> on a
>> >>> particular database should I expect that the document counts across
>> >>> cluster
>> >>> members would match, or is that not necessarily expected?
On Mon, Oct 29, 2012 at 11:09 AM, Paul Davis <paul.joseph.da...@gmail.com>wrote:
> You should short circuit the validation if the user has the "_admin"
> role. Once you have that set then you should run an
> _all_docs?include_docs=true over the database and it should fix itself
> up.
Excellent. Can't thank you enough for that tip. I'll give it a shot later
today and report back either way so the loop is closed.
On Mon, Oct 29, 2012 at 11:11 AM, Matthew Woodward <m...@mattwoodward.com>wrote:
> On Mon, Oct 29, 2012 at 11:09 AM, Paul Davis <paul.joseph.da...@gmail.com>wrote:
>> You should short circuit the validation if the user has the "_admin"
>> role. Once you have that set then you should run an
>> _all_docs?include_docs=true over the database and it should fix itself
>> up.
> Excellent. Can't thank you enough for that tip. I'll give it a shot later
> today and report back either way so the loop is closed.
Well, that doesn't seem to have fixed it -- even on a node that I know has
shards on it I can watch the document count vacillate (which i guess means
a node doesn't always use its own shards?), and the count across the nodes
still changes on practically every refresh of Futon.
What I might do next is try to figure out which is the node whose shards
have the highest document count (which should be "correct" given the nature
of this database), copy those shards to the other two servers that have
shards for this database, and see how things go from there.
Appreciate the tip on the _admin user -- great info regardless of whether
or not it fixed my immediate issue.
On Mon, Oct 29, 2012 at 12:59 PM, Matthew Woodward <m...@mattwoodward.com>wrote:
> What I might do next is try to figure out which is the node whose shards
> have the highest document count (which should be "correct" given the nature
> of this database), copy those shards to the other two servers that have
> shards for this database, and see how things go from there.
I think I have a bit more info on the issue, so we'll see if the logic
hangs with the people who know a lot more about how all this works behind
the scenes than I do.
As I mentioned before, the only databases with which we were having issues
were ones where we had applications writing to a host name on our load
balancer as opposed to writing directly to an individual cluster node.
After doing a lot more digging around I discovered that in the vm.args
file, the user name was the same (the default of simply bigcouch) on all
the nodes, as opposed to the recommended (required?) practice of having
each node have its own full name, e.g. bigco...@server.com
Given the issue with the names being the same in vm.args and the fact that
the only databases we ever had an issue with are ones where we were writing
to multiple nodes instead of only one, I'm wondering if this is the root
cause of the differing document counts.
I certainly hope that's the case because I'd love to be able to throw
writes at the cluster as a whole instead of to individual nodes, but given
the issues we had when we did that I'm a bit reluctant.
As for how I (hopefully) solved the problem, I picked the "correct" shards
and copied those to the other servers that had shards for the offending
databases in addition to adding the _admin user to any validate_doc_update
functions and we'll see how that goes. With DC shut down because of the
hurricane I won't know for a few days but I'm keeping my fingers crossed.
Thanks again to everyone for the input! I'd love to hear if having the same
name in vm.args would potentially cause the sorts of issues we were seeing
so I can spread the write load across our cluster, so if anyone has input
on that it'd be much appreciated.
-- Matthew Woodward
m...@mattwoodward.com
http://blog.mattwoodward.com identi.ca / Twitter: @mpwoodward
You can check how the cluster is connected by hitting
http://hostname:5984/_membership on each node and comparing. The
username isn't really a username, its an Erlang node name. Assuming
your cluster is connected then those will be unique because they'll
have 'nodename@hostname' which is good enough to differentiate.
On Tue, Oct 30, 2012 at 10:05 AM, Matthew Woodward
<m...@mattwoodward.com> wrote:
> On Mon, Oct 29, 2012 at 12:59 PM, Matthew Woodward <m...@mattwoodward.com>
> wrote:
>> What I might do next is try to figure out which is the node whose shards
>> have the highest document count (which should be "correct" given the nature
>> of this database), copy those shards to the other two servers that have
>> shards for this database, and see how things go from there.
> I think I have a bit more info on the issue, so we'll see if the logic hangs
> with the people who know a lot more about how all this works behind the
> scenes than I do.
> As I mentioned before, the only databases with which we were having issues
> were ones where we had applications writing to a host name on our load
> balancer as opposed to writing directly to an individual cluster node.
> After doing a lot more digging around I discovered that in the vm.args file,
> the user name was the same (the default of simply bigcouch) on all the
> nodes, as opposed to the recommended (required?) practice of having each
> node have its own full name, e.g. bigco...@server.com
> Given the issue with the names being the same in vm.args and the fact that
> the only databases we ever had an issue with are ones where we were writing
> to multiple nodes instead of only one, I'm wondering if this is the root
> cause of the differing document counts.
> I certainly hope that's the case because I'd love to be able to throw writes
> at the cluster as a whole instead of to individual nodes, but given the
> issues we had when we did that I'm a bit reluctant.
> As for how I (hopefully) solved the problem, I picked the "correct" shards
> and copied those to the other servers that had shards for the offending
> databases in addition to adding the _admin user to any validate_doc_update
> functions and we'll see how that goes. With DC shut down because of the
> hurricane I won't know for a few days but I'm keeping my fingers crossed.
> Thanks again to everyone for the input! I'd love to hear if having the same
> name in vm.args would potentially cause the sorts of issues we were seeing
> so I can spread the write load across our cluster, so if anyone has input on
> that it'd be much appreciated.
> --
> Matthew Woodward
> m...@mattwoodward.com
> http://blog.mattwoodward.com > identi.ca / Twitter: @mpwoodward
On Tue, Oct 30, 2012 at 11:17 AM, Paul Davis <paul.joseph.da...@gmail.com>wrote:
> You can check how the cluster is connected by hitting
> http://hostname:5984/_membership on each node and comparing. The
> username isn't really a username, its an Erlang node name. Assuming
> your cluster is connected then those will be unique because they'll
> have 'nodename@hostname' which is good enough to differentiate.
Interesting -- did that previously (and double-checked again just now) to
be sure and that all looks good (all have user@host), so guess that means
the vm.args thing being wrong wasn't the culprit. Bummer.
Well, we'll see how things hold up after getting the shards realigned and
maybe I'll set up a dummy database and a script that throws writes at the
cluster to see if I can duplicate the issue.
Thanks so much again for all the info and suggestions.