Cluster down, cannot restart

578 views
Skip to first unread message

Roy Smith

unread,
Apr 20, 2014, 6:31:46 PM4/20/14
to etcd...@googlegroups.com
I've got a cluster of three servers (etc1, etc2, and etc3). Two of them have (apparently) crashed, and I can't restart them. etc2 is up, etc1 and etc3 are down.

When I try to restart either of those, they immediately crash and log this:

[etcd] Apr 20 22:19:39.301 INFO | etcd server [name etc3.songza.com, listen on :4001, advertised url http://127.0.0.1:4001]
[etcd] Apr 20 22:19:39.304 INFO | etc3.songza.com: peer added: 'etc1.songza.com'
[etcd] Apr 20 22:19:39.312 INFO | etc3.songza.com: peer added: 'etc2.songza.com'
[etcd] Apr 20 22:19:39.374 INFO | URLs: /_etcd/machines: / etc3.songza.com (http://etc1.songza.com:7001,http://127.0.0.1:7001,http://etc3.songza.com:7001)
[etcd] Apr 20 22:19:39.383 WARNING | Attempt to join via etc1.songza.com:7001 failed: Error during join version check: Get http://etc1.songza.com:7001/version: dial tcp 10.215.34.80:7001: connection refused
[etcd] Apr 20 22:19:39.384 WARNING | Attempt to join via etc3.songza.com:7001 failed: Error during join version check: Get http://etc3.songza.com:7001/version: dial tcp 10.122.34.52:7001: connection refused
[etcd] Apr 20 22:19:39.384 WARNING | No living peers are found!
[etcd] Apr 20 22:19:39.384 INFO | etc3.songza.com: state changed from 'initialized' to 'follower'.
[etcd] Apr 20 22:19:39.384 INFO | peer server [name etc3.songza.com, listen on :7001, advertised url http://127.0.0.1:7001]
[etcd] Apr 20 22:19:39.610 INFO | etc3.songza.com: term #428062 started.
[etcd] Apr 20 22:19:39.714 INFO | etc3.songza.com: state changed from 'follower' to 'candidate'.
[etcd] Apr 20 22:19:39.715 INFO | etc3.songza.com: state changed from 'candidate' to 'leader'.
[etcd] Apr 20 22:19:39.715 INFO | etc3.songza.com: leader changed from '' to 'etc3.songza.com'.
panic: assertion failed: leader.elected.at.same.term.428063

The conf file in my data directory contains:

{"commitIndex":0,"peers":[{"name":"etc1.songza.com","connectionString":""},{"name":"etc2.songza.com","connectionString":""}]}




--
Roy Smith
r...@panix.com

Brandon Philips

unread,
Apr 20, 2014, 8:11:40 PM4/20/14
to Roy Smith, etcd...@googlegroups.com
Is this using the master version or the 0.3 release?
> --
> You received this message because you are subscribed to the Google Groups "etcd-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Roy Smith

unread,
Apr 21, 2014, 12:15:58 AM4/21/14
to Brandon Philips, etcd...@googlegroups.com
I'm running "etcd version 0.3.0+git"

I built this from source. Looks like the most recent change I've got is:

> commit d266f9dec718dc3e6b571f09b973a0a8be5718f1
> Merge: d4b33c7 1af51ca
> Author: Yicheng Qin <qycqycq...@gmail.com>
> Date: Mon Apr 14 11:06:31 2014 -0700
>
> Merge pull request #713 from unihorn/63
>
> fix(server/key): unable to update unexpired ttl




On Apr 20, 2014, at 5:11 PM, Brandon Philips wrote:

> Is this using the master version or the 0.3 release?


--
Roy Smith
r...@panix.com



Yicheng Qin

unread,
Apr 21, 2014, 2:31:38 AM4/21/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
Hi Roy,

I think the version is ok. But if you could update it to the latest version, it would be the best.
Do you still have the command to start and restart these nodes? And operations with details?
What's etc2 log when etc3 tries to join the cluster?

The question for now is that the log cannot match with each other now. One possible solution is to restart etc1 and etc3 without old data log, and the cluster will use etc2's log as the cluster log. But this is a dangerous operation, so I would like to know the ideas from other guys.

Yicheng


--
Roy Smith
r...@panix.com



Roy Smith

unread,
Apr 21, 2014, 9:31:42 AM4/21/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
At this point, we're not running in production.  This is just an experimental setup to evaluate the technology, so I'm fine with blowing away my current data and starting again from scratch. 

I was starting from https://github.com/coreos/etcd/blob/master/Documentation/clustering.md and trying to translate the examples to something that made sense for my environment.  What I ended up with was:

# on etc1:
./bin/etcd \
    -addr etc1.songza.com:4001 \
    -bind-addr 0.0.0.0 \
    -peer-addr etc1.songza.com:7001 \
    -peer-bind-addr 0.0.0.0 \
    -data-dir /home/etcd/data/etc1.songza.com \
    -name etc1.songza.com

# on etc2:
./bin/etcd \
    -addr etc2.songza.com:4001 \
    -bind-addr 0.0.0.0 \
    -peer-addr etc2.songza.com:7001 \
    -peer-bind-addr 0.0.0.0 \
    -data-dir data/etc2.songza.com \
    -name etc2.songza.com

# on etc3:
./bin/etcd \
    -addr etc3.songza.com:4001 \
    -bind-addr 0.0.0.0 \
    -peer-addr etc3.songza.com:7001 \
    -peer-bind-addr 0.0.0.0 \
    -data-dir data/etc3.songza.com \
    -name etc3.songza.com



When I try to restart etc3, I see nothing in the etc2 logs at all.  Which jive with what etc3 is logging:

[etcd] Apr 21 13:11:45.851 WARNING   | Attempt to join via etc1.songza.com:7001 failed: Error during join version check: Get http://etc1.songza.com:7001/version: dial tcp 10.215.34.80:7001: connection refused
[etcd] Apr 21 13:11:45.851 WARNING   | Attempt to join via etc3.songza.com:7001 failed: Error during join version check: Get http://etc3.songza.com:7001/version: dial tcp 10.122.34.52:7001: connection refused

Why is etc3 trying to join with itself (the second log line).  Is that normal?  Actually, I don't see anything being logging on etc1 either.



--
Roy Smith



Yicheng Qin

unread,
Apr 21, 2014, 12:33:06 PM4/21/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
The data is still there, but we want to find out where the problem is. :)
If etc3 is disconnected from the cluster, the leader should always say that 'miss heartbeat from etc2', and you should see some warnings, but etc3 should be able to connect back to the cluster.
I have some simulation script here, and it works fine:
```
rm -rf machines

./bin/etcd -peer-addr 127.0.0.1:7001 -addr 127.0.0.1:4001 -data-dir machines/machine1 -name machine1 &
node1=$!
./bin/etcd -peer-addr 127.0.0.1:7002 -addr 127.0.0.1:4002 -peers 127.0.0.1:7001,127.0.0.1:7003 -data-dir machines/machine2 -name machine2 &
node2=$!
./bin/etcd -peer-addr 127.0.0.1:7003 -addr 127.0.0.1:4003 -peers 127.0.0.1:7001,127.0.0.1:7002 -data-dir machines/machine3 -name machine3 &
node3=$!

sleep 10

curl -L http://127.0.0.1:4001/v2/keys/foo -XPUT -d value=bar

kill $node3
echo machine3 is killed

sleep 5

./bin/etcd -peer-addr 127.0.0.1:7003 -addr 127.0.0.1:4003 -peers 127.0.0.1:7001,127.0.0.1:7002 -data-dir machines/machine3 -name machine3 &
node3=$?
echo machine3 is restarted

sleep 10

curl -L http://127.0.0.1:4003/v2/keys/foo

kill `pgrep etcd`
```
Could you try this script, or tell us the difference from this one?
The etcd is built from the latest HEAD.

Roy Smith

unread,
Apr 22, 2014, 10:16:26 PM4/22/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
Hi,

My apologies for not getting back to you sooner; I've been on the road.

I ran your script, on my etc3 machine.  At the time I started the script, there were no etcd processes running on that machine.  Here's what it printed:

$ sh go2
[etcd] Apr 23 01:57:51.807 INFO      | etcd server [name machine3, listen on :4003, advertised url http://127.0.0.1:4003]
[etcd] Apr 23 01:57:51.809 INFO      | etcd server [name machine2, listen on :4002, advertised url http://127.0.0.1:4002]
[etcd] Apr 23 01:57:51.813 INFO      | etcd server [name machine1, listen on :4001, advertised url http://127.0.0.1:4001]
[etcd] Apr 23 01:57:51.814 INFO      | URLs: /_etcd/machines:  / machine3 ()
[etcd] Apr 23 01:57:51.815 INFO      | URLs: /_etcd/machines:  / machine2 ()
[etcd] Apr 23 01:57:51.816 WARNING   | Attempt to join via 127.0.0.1:7001 failed: Error during join version check: Get http://127.0.0.1:7001/version: dial tcp 127.0.0.1:7001: connection refused
[etcd] Apr 23 01:57:51.817 WARNING   | Attempt to join via 127.0.0.1:7002 failed: Error during join version check: Get http://127.0.0.1:7002/version: dial tcp 127.0.0.1:7002: connection refused
[etcd] Apr 23 01:57:51.817 WARNING   | No living peers are found!
[etcd] Apr 23 01:57:51.817 CRITICAL  | No available peers in backup list, and no log data
[etcd] Apr 23 01:57:51.817 INFO      | URLs: /_etcd/machines:  / machine1 ()
[etcd] Apr 23 01:57:51.819 WARNING   | Attempt to join via 127.0.0.1:7001 failed: Error during join version check: Get http://127.0.0.1:7001/version: dial tcp 127.0.0.1:7001: connection refused
[etcd] Apr 23 01:57:51.819 WARNING   | Attempt to join via 127.0.0.1:7003 failed: Error during join version check: Get http://127.0.0.1:7003/version: dial tcp 127.0.0.1:7003: connection refused
[etcd] Apr 23 01:57:51.819 WARNING   | No living peers are found!
[etcd] Apr 23 01:57:51.819 CRITICAL  | No available peers in backup list, and no log data
[etcd] Apr 23 01:57:51.820 INFO      | This peer is starting a brand new cluster now.
[etcd] Apr 23 01:57:51.820 INFO      | machine1: state changed from 'initialized' to 'follower'.
[etcd] Apr 23 01:57:51.820 INFO      | machine1: state changed from 'follower' to 'leader'.
[etcd] Apr 23 01:57:51.820 INFO      | machine1: leader changed from '' to 'machine1'.
[etcd] Apr 23 01:57:51.830 INFO      | peer server [name machine1, listen on :7001, advertised url http://127.0.0.1:7001]
[etcd] Apr 23 01:57:51.847 INFO      | machine1: state changed from 'leader' to 'follower'.
[etcd] Apr 23 01:57:51.847 INFO      | machine1: term #1046725 started.
[etcd] Apr 23 01:57:51.847 INFO      | machine1: leader changed from 'machine1' to 'etc2.songza.com'.
[etcd] Apr 23 01:57:51.895 INFO      | machine1: state changed from 'follower' to 'snapshotting'.
[etcd] Apr 23 01:57:51.899 INFO      | machine1: peer added: 'etc1.songza.com'
[etcd] Apr 23 01:57:51.904 INFO      | machine1: peer added: 'etc2.songza.com'
[etcd] Apr 23 01:57:51.909 INFO      | machine1: peer added: 'etc3.songza.com'
[etcd] Apr 23 01:57:51.945 INFO      | machine1: state changed from 'snapshotting' to 'follower'.
[etcd] Apr 23 01:57:51.945 INFO      | machine1: term #1046725 started.
[etcd] Apr 23 01:57:51.951 INFO      | machine1: peer removed: 'etc1.songza.com'
[etcd] Apr 23 01:57:54.840 INFO      | machine1: snapshot of 443496 events at index 443496 completed

I don't understand what's happening here.  This new cluster apparently discovered my server running on etc2.songza.com and joined that.  I don't understand how it did that.

Yicheng Qin

unread,
Apr 23, 2014, 10:00:38 AM4/23/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
machine2 and machine3 fails to find `-peers` work, so they quit to tell user that they cannot join the cluster as the user wants. This is a bug in early version, and we have just fixed it. Could you fetch the HEAD from github? Sorry for confusion.

machine1 is up and is found by your old cluster. So it is added to the old cluster automatically because the address is used in the old cluster, and currently we haven't developed the way to verify the etcd name. The way to fix this is either close the old etcds/cluster, or start machine1 on port other than 7001 (maybe 7004).

Try it?

Yicheng

Roy Smith

unread,
Apr 23, 2014, 11:07:22 AM4/23/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
OK, I'm going to rebuild from HEAD and try this all again.  I'll let you know how that goes.

But, in the meantime, I see one another problem (or, maybe it's the bug you're talking about).  When I first read your mail, I couldn't understand how my other cluster could have found this one, since the servers started up by your script specify 127.0.0.1, which should not be reachable from other machines.  But, looking at the netstat output:

# netstat -natp | grep etcd
tcp6       0      0 :::7001                 :::*                    LISTEN      10563/etcd      
tcp6       0      0 :::4001                 :::*                    LISTEN      10563/etcd      

and strace of a server as it starts up:

machine1.trace.10660:bind(3, {sa_family=AF_INET6, sin6_port=htons(4001), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
machine1.trace.10664:bind(6, {sa_family=AF_INET6, sin6_port=htons(7001), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0

I can see that it's binding to the wildcard, not to 127.0.0.1.



---
Roy Smith



Roy Smith

unread,
Apr 23, 2014, 11:47:06 AM4/23/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
I just did I pull. I'm up to date with:

commit a563d82f95d60c6f3ddf3619ed150e09856daf2f
Merge: 9af9438 c7116a3
Author: Yicheng Qin <qycqycq...@gmail.com>
Date: Tue Apr 22 14:55:17 2014 -0700

Merge pull request #752 from unihorn/75

fix(server): dispatch SetClusterConfigCommand correctly

When I rebuild and run your script with the new build, I get the following. I assume that's what I'm supposed to be seeing? But, I'm still seeing the server bind to the wildcard address, as I mentioned in my previous email.

$ sh go2
[etcd] Apr 23 15:42:20.306 INFO | etcd server [name machine3, listen on :4003, advertised url http://127.0.0.1:4003]
[etcd] Apr 23 15:42:20.306 INFO | etcd server [name machine2, listen on :4002, advertised url http://127.0.0.1:4002]
[etcd] Apr 23 15:42:20.309 INFO | etcd server [name machine1, listen on :4001, advertised url http://127.0.0.1:4001]
[etcd] Apr 23 15:42:20.310 WARNING | Attempt to join via 127.0.0.1:7001 failed: Error during join version check: Get http://127.0.0.1:7001/version: dial tcp 127.0.0.1:7001: connection refused
[etcd] Apr 23 15:42:20.311 WARNING | Attempt to join via 127.0.0.1:7002 failed: Error during join version check: Get http://127.0.0.1:7002/version: dial tcp 127.0.0.1:7002: connection refused
[etcd] Apr 23 15:42:20.311 WARNING | machine3 is unable to join the cluster using any of the peers [127.0.0.1:7001 127.0.0.1:7002] at 0th time. Retrying in 3.4 seconds
[etcd] Apr 23 15:42:20.312 INFO | machine1 is starting a new cluster.
[etcd] Apr 23 15:42:20.313 INFO | machine1: state changed from 'initialized' to 'follower'.
[etcd] Apr 23 15:42:20.313 INFO | machine1: state changed from 'follower' to 'leader'.
[etcd] Apr 23 15:42:20.313 INFO | machine1: leader changed from '' to 'machine1'.
[etcd] Apr 23 15:42:20.314 WARNING | Attempt to join via 127.0.0.1:7001 failed: Error during join version check: Get http://127.0.0.1:7001/version: dial tcp 127.0.0.1:7001: connection refused
[etcd] Apr 23 15:42:20.314 WARNING | Attempt to join via 127.0.0.1:7003 failed: Error during join version check: Get http://127.0.0.1:7003/version: dial tcp 127.0.0.1:7003: connection refused
[etcd] Apr 23 15:42:20.314 WARNING | machine2 is unable to join the cluster using any of the peers [127.0.0.1:7001 127.0.0.1:7003] at 0th time. Retrying in 4.8 seconds
[etcd] Apr 23 15:42:20.324 INFO | peer server [name machine1, listen on :7001, advertised url http://127.0.0.1:7001]
[etcd] Apr 23 15:42:23.312 INFO | Send Join Request to http://127.0.0.1:7001/v2/admin/machines/machine3
[etcd] Apr 23 15:42:23.313 INFO | machine1: peer added: 'machine3'
[etcd] Apr 23 15:42:23.320 INFO | »»»» 200
[etcd] Apr 23 15:42:23.321 INFO | machine3: state changed from 'initialized' to 'follower'.
[etcd] Apr 23 15:42:23.321 INFO | peer server [name machine3, listen on :7003, advertised url http://127.0.0.1:7003]
[etcd] Apr 23 15:42:23.368 INFO | machine3: peer added: 'machine1'
[etcd] Apr 23 15:42:24.315 INFO | Send Join Request to http://127.0.0.1:7001/v2/admin/machines/machine2
[etcd] Apr 23 15:42:24.369 INFO | machine1: peer added: 'machine2'
[etcd] Apr 23 15:42:24.375 INFO | »»»» 200
[etcd] Apr 23 15:42:24.376 INFO | machine2: state changed from 'initialized' to 'follower'.
[etcd] Apr 23 15:42:24.376 INFO | peer server [name machine2, listen on :7002, advertised url http://127.0.0.1:7002]
[etcd] Apr 23 15:42:24.413 INFO | machine3: peer added: 'machine2'
[etcd] Apr 23 15:42:24.425 INFO | machine2: peer added: 'machine1'
[etcd] Apr 23 15:42:24.430 INFO | machine2: peer added: 'machine3'
{"action":"set","node":{"key":"/foo","value":"bar","modifiedIndex":4,"createdIndex":4}}machine3 is killed
[etcd] Apr 23 15:42:30.377 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.413 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.463 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.513 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.563 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.613 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.663 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.713 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.763 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.813 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.863 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.913 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:30.963 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.013 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.063 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.113 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.163 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.213 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.263 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.313 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.363 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.413 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.463 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.513 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.563 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.613 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.663 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.713 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.763 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.813 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.863 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.913 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:31.963 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.013 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.063 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.113 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.163 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.213 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.263 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.313 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.363 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.413 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.463 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.513 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.563 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.613 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.663 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.713 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.763 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.813 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.863 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.913 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:32.963 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.013 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.063 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.113 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.163 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.213 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.263 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.313 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.363 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.413 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.463 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.513 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.563 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.613 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.663 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.713 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.763 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.813 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.863 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.913 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:33.963 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.013 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.063 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.113 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.163 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.213 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.263 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.313 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.363 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.413 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.463 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.513 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.563 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.613 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.663 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.713 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.763 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.813 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.863 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.913 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:34.963 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.013 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.063 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.113 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.163 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.213 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.263 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.313 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.363 INFO | machine1: warning: heartbeat timed out: 'machine3'
machine3 is restarted
[etcd] Apr 23 15:42:35.381 INFO | etcd server [name machine3, listen on :4003, advertised url http://127.0.0.1:4003]
[etcd] Apr 23 15:42:35.383 INFO | machine3: peer added: 'machine1'
[etcd] Apr 23 15:42:35.395 INFO | machine3: peer added: 'machine2'
[etcd] Apr 23 15:42:35.401 INFO | Peer URLs in log: / machine3 (http://127.0.0.1:7001,http://127.0.0.1:7003,http://127.0.0.1:7002)
[etcd] Apr 23 15:42:35.402 INFO | Send Join Request to http://127.0.0.1:7001/v2/admin/machines/machine3
[etcd] Apr 23 15:42:35.462 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.463 INFO | machine1: warning: heartbeat timed out: 'machine3'
[etcd] Apr 23 15:42:35.468 INFO | »»»» 200
[etcd] Apr 23 15:42:35.468 INFO | machine3: state changed from 'initialized' to 'follower'.
[etcd] Apr 23 15:42:35.468 INFO | peer server [name machine3, listen on :7003, advertised url http://127.0.0.1:7003]
{"action":"get","node":{"key":"/foo","value":"bar","modifiedIndex":4,"createdIndex":4}}etcd@etc3:~$

On Apr 23, 2014, at 11:07 AM, Roy Smith wrote:

> OK, I'm going to rebuild from HEAD and try this all again. I'll let you know how that goes.




---
Roy Smith
r...@panix.com



Yicheng Qin

unread,
Apr 23, 2014, 3:40:16 PM4/23/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
The script runs well!
What the script does is that it starts 3 etcd, and connect them together. Later it kills one node, and restart it to see whether or not it could be connected back. And it works.
Could you try to replay it in this way?

If the restart still fails, you could kill all etcds, and restart them with the latest version binary using the same flags. Don't use '-f', and etcd should load from the existing logs. And paste the log of each of them. It is really helpful if we could see this log, so we could know what actually happens in the cluster.

I doubt you mess up logs of different clusters together, so it cannot work fine. If you plan to play with them from beginning, remember to remove all data dirs first. :)

Thanks!

Yicheng

Roy Smith

unread,
Apr 23, 2014, 5:58:05 PM4/23/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
Things are not going well.  I installed the newest etcd binary on all my servers.  I then stopped all the etcd processes, moved the data directories somewhere else, and restarted all three.  Things seemed to start up OK, but when I stopped etc1, I got this on etc3:

[…]
[etcd] Apr 23 21:52:21.731 INFO      | etc3.songza.com: warning: heartbeat timed out: 'etc1.songza.com'
[etcd] Apr 23 21:52:21.781 INFO      | etc3.songza.com: warning: heartbeat timed out: 'etc1.songza.com'
[etcd] Apr 23 21:52:21.832 INFO      | etc3.songza.com: warning: heartbeat timed out: 'etc1.songza.com'
[etcd] Apr 23 21:52:21.881 INFO      | etc3.songza.com: warning: heartbeat timed out: 'etc1.songza.com'
[etcd] Apr 23 21:52:21.894 INFO      | Rejoin with http://127.0.0.1:7001 instead of http://etc1.songza.com:7001 from etc1.songza.com
panic: assertion failed: leader.elected.at.same.term.1


goroutine 12 [running]:
runtime.panic(0x735a80, 0xc21022d9c0)
/home/roy/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/coreos/etcd/third_party/github.com/goraft/raft._assert(0x86fa00, 0x8dd2f0, 0x1f, 0x7f1de5fa0cc8, 0x1, ...)
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*server).processAppendEntriesRequest(0xc2100577e0, 0xc2101a0140, 0xc2101a0140, 0xc2101a0140)
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*server).leaderLoop(0xc2100577e0)

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7f1de6161300, 0x72, 0x0)
/home/roy/go/src/pkg/runtime/netpoll.goc:116 +0x6a
net.(*pollDesc).Wait(0xc210052990, 0x72, 0x7f1de61600e8, 0xb)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:81 +0x34
net.(*pollDesc).WaitRead(0xc210052990, 0xb, 0x7f1de61600e8)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:86 +0x30
net.(*netFD).accept(0xc210052930, 0x949f90, 0x0, 0x7f1de61600e8, 0xb)
/home/roy/go/src/pkg/net/fd_unix.go:382 +0x2c2
net.(*TCPListener).AcceptTCP(0xc210000478, 0x18, 0xc210042010, 0x517f63)
/home/roy/go/src/pkg/net/tcpsock_posix.go:233 +0x47
net.(*TCPListener).Accept(0xc210000478, 0x50, 0xef5d00, 0x18, 0x0)
/home/roy/go/src/pkg/net/tcpsock_posix.go:243 +0x27
net/http.(*Server).Serve(0xc2100b2c80, 0x7f1de6162180, 0xc210000478, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/server.go:1622 +0x91
net/http.Serve(0x7f1de6162180, 0xc210000478, 0x7f1de61623d8, 0xc2100c6860, 0x0, ...)
/home/roy/go/src/pkg/net/http/server.go:1561 +0x70
github.com/coreos/etcd/etcd.(*Etcd).Run(0xc21005b540)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/etcd/etcd.go:219 +0x1d59
main.main()
/home/roy/etcd/gopath/src/github.com/coreos/etcd/main.go:43 +0x2b2

goroutine 4 [syscall]:
os/signal.loop()
/home/roy/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
/home/roy/go/src/pkg/os/signal/signal_unix.go:27 +0x31

goroutine 5 [IO wait]:
net.runtime_pollWait(0x7f1de61611b0, 0x72, 0x0)
/home/roy/go/src/pkg/runtime/netpoll.goc:116 +0x6a
net.(*pollDesc).Wait(0xc2100c21b0, 0x72, 0x7f1de61600e8, 0xb)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:81 +0x34
net.(*pollDesc).WaitRead(0xc2100c21b0, 0xb, 0x7f1de61600e8)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:86 +0x30
net.(*netFD).accept(0xc2100c2150, 0x949f90, 0x0, 0x7f1de61600e8, 0xb)
/home/roy/go/src/pkg/net/fd_unix.go:382 +0x2c2
net.(*TCPListener).AcceptTCP(0xc210000c00, 0x51818b, 0x7f1de5fd8dd8, 0x51818b)
/home/roy/go/src/pkg/net/tcpsock_posix.go:233 +0x47
net.(*TCPListener).Accept(0xc210000c00, 0x7f1de6162400, 0xc21013c8e0, 0xc210182800, 0x0)
/home/roy/go/src/pkg/net/tcpsock_posix.go:243 +0x27
net/http.(*Server).Serve(0xc2100b2af0, 0x7f1de6162180, 0xc210000c00, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/server.go:1622 +0x91
net/http.Serve(0x7f1de6162180, 0xc210000c00, 0x7f1de61623d8, 0xc210050e80, 0x0, ...)
/home/roy/go/src/pkg/net/http/server.go:1561 +0x70
/home/roy/etcd/gopath/src/github.com/coreos/etcd/etcd/etcd.go:210 +0x34b
created by github.com/coreos/etcd/etcd.(*Etcd).Run
/home/roy/etcd/gopath/src/github.com/coreos/etcd/etcd/etcd.go:216 +0x1c58

goroutine 27 [select]:
net/http.(*persistConn).roundTrip(0xc210182880, 0xc21022d3c0, 0xc210182880, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/transport.go:879 +0x6d6
net/http.(*Transport).RoundTrip(0xc210062a00, 0xc2101d3820, 0xc21022c3c0, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/transport.go:187 +0x391
github.com/coreos/etcd/third_party/github.com/mreiferson/go-httpclient.(*Transport).RoundTrip(0xc21005b660, 0xc2101d3820, 0x1, 0x0, 0x0)
net/http.send(0xc2101d3820, 0x7f1de6161ec0, 0xc21005b660, 0x0, 0x0, ...)
/home/roy/go/src/pkg/net/http/client.go:168 +0x37f
net/http.(*Client).send(0xc2100a67e0, 0xc2101d3820, 0x20, 0xc21022e6c0, 0xc21022d3b0)
/home/roy/go/src/pkg/net/http/client.go:100 +0xd9
net/http.(*Client).doFollowingRedirects(0xc2100a67e0, 0xc2101d3820, 0x94af68, 0x0, 0x0, ...)
/home/roy/go/src/pkg/net/http/client.go:294 +0x671
net/http.(*Client).Do(0xc2100a67e0, 0xc2101d3820, 0xc21022e690, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/client.go:132 +0x157
github.com/coreos/etcd/server.(*transporter).Post(0xc2100a8100, 0xc21022e690, 0x20, 0x7f1de61627a0, 0xc2101fd3f0, ...)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/transporter.go:235 +0x82
github.com/coreos/etcd/server.(*transporter).SendAppendEntriesRequest(0xc2100a8100, 0x7f1de6161f58, 0xc2100577e0, 0xc2100d8690, 0xc2101a00a0, ...)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/transporter.go:108 +0x530
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*Peer).sendAppendEntriesRequest(0xc2100d8690, 0xc2101a00a0)
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*Peer).heartbeat(0xc2100d8690, 0xc21011c3c0)
created by github.com/coreos/etcd/third_party/github.com/goraft/raft.(*Peer).startHeartbeat

goroutine 9 [syscall]:
runtime.goexit()
/home/roy/go/src/pkg/runtime/proc.c:1394

goroutine 26 [select]:
net/http.(*persistConn).writeLoop(0xc210114c80)
/home/roy/go/src/pkg/net/http/transport.go:791 +0x271
created by net/http.(*Transport).dialConn
/home/roy/go/src/pkg/net/http/transport.go:529 +0x61e

goroutine 13 [chan receive]:
github.com/coreos/etcd/server.(*PeerServer).monitorSync(0xc21000a0f0)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:718 +0x59
created by github.com/coreos/etcd/server.(*PeerServer).Start
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:296 +0x351

goroutine 14 [select]:
github.com/coreos/etcd/server.(*PeerServer).monitorTimeoutThreshold(0xc21000a0f0, 0xc2100b87e0)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:730 +0x1a8
created by github.com/coreos/etcd/server.(*PeerServer).Start
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:297 +0x377

goroutine 15 [select]:
github.com/coreos/etcd/server.(*PeerServer).monitorActiveSize(0xc21000a0f0, 0xc2100b87e0)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:745 +0xeb4
created by github.com/coreos/etcd/server.(*PeerServer).Start
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:298 +0x39d

goroutine 16 [select]:
github.com/coreos/etcd/server.(*PeerServer).monitorPeerActivity(0xc21000a0f0, 0xc2100b87e0)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:807 +0x632
created by github.com/coreos/etcd/server.(*PeerServer).Start
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:299 +0x3c3

goroutine 17 [sleep]:
time.Sleep(0xb2d05e00)
/home/roy/go/src/pkg/runtime/time.goc:31 +0x31
github.com/coreos/etcd/server.(*PeerServer).monitorSnapshot(0xc21000a0f0)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:703 +0x31
created by github.com/coreos/etcd/server.(*PeerServer).Start
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:303 +0x3e7

goroutine 28 [select]:
net/http.(*persistConn).roundTrip(0xc210114c80, 0xc21022d4a0, 0xc210114c80, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/transport.go:879 +0x6d6
net/http.(*Transport).RoundTrip(0xc210062a00, 0xc2101d38f0, 0xc21022c400, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/transport.go:187 +0x391
github.com/coreos/etcd/third_party/github.com/mreiferson/go-httpclient.(*Transport).RoundTrip(0xc21005b660, 0xc2101d38f0, 0x1, 0x0, 0x0)
net/http.send(0xc2101d38f0, 0x7f1de6161ec0, 0xc21005b660, 0x0, 0x0, ...)
/home/roy/go/src/pkg/net/http/client.go:168 +0x37f
net/http.(*Client).send(0xc2100a67e0, 0xc2101d38f0, 0x26, 0xc21022e780, 0xc21022d490)
/home/roy/go/src/pkg/net/http/client.go:100 +0xd9
net/http.(*Client).doFollowingRedirects(0xc2100a67e0, 0xc2101d38f0, 0x94af68, 0x0, 0x0, ...)
/home/roy/go/src/pkg/net/http/client.go:294 +0x671
net/http.(*Client).Do(0xc2100a67e0, 0xc2101d38f0, 0xc21022e750, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/client.go:132 +0x157
github.com/coreos/etcd/server.(*transporter).Post(0xc2100a8100, 0xc21022e750, 0x26, 0x7f1de61627a0, 0xc2101fd540, ...)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/transporter.go:235 +0x82
github.com/coreos/etcd/server.(*transporter).SendAppendEntriesRequest(0xc2100a8100, 0x7f1de6161f58, 0xc2100577e0, 0xc2100d8850, 0xc2101a00f0, ...)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/transporter.go:108 +0x530
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*Peer).sendAppendEntriesRequest(0xc2100d8850, 0xc2101a00f0)
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*Peer).heartbeat(0xc2100d8850, 0xc21011c480)
created by github.com/coreos/etcd/third_party/github.com/goraft/raft.(*Peer).startHeartbeat

goroutine 560 [select]:
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*server).send(0xc2100577e0, 0x7e6bc0, 0xc2101a0140, 0x747, 0xecaea2d15, ...)
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*server).AppendEntries(0xc2100577e0, 0xc2101a0140, 0xf)
github.com/coreos/etcd/server.(*PeerServer).AppendEntriesHttpHandler(0xc21000a0f0, 0x7f1de6162960, 0xc2100c7640, 0xc2101d39c0)
github.com/coreos/etcd/server.*PeerServer.AppendEntriesHttpHandler·fm(0x7f1de6162960, 0xc2100c7640, 0xc2101d39c0)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/server/peer_server.go:333 +0x44
net/http.HandlerFunc.ServeHTTP(0xc2100c9c90, 0x7f1de6162960, 0xc2100c7640, 0xc2101d39c0)
/home/roy/go/src/pkg/net/http/server.go:1220 +0x40
github.com/coreos/etcd/third_party/github.com/gorilla/mux.(*Router).ServeHTTP(0xc2100b2eb0, 0x7f1de6162960, 0xc2100c7640, 0xc2101d39c0)
github.com/coreos/etcd/http.(*CORSHandler).ServeHTTP(0xc210050e80, 0x7f1de6162960, 0xc2100c7640, 0xc2101d39c0)
/home/roy/etcd/gopath/src/github.com/coreos/etcd/http/cors.go:74 +0x1a0
net/http.serverHandler.ServeHTTP(0xc2100b2af0, 0x7f1de6162960, 0xc2100c7640, 0xc2101d39c0)
/home/roy/go/src/pkg/net/http/server.go:1597 +0x16e
net/http.(*conn).serve(0xc210182800)
/home/roy/go/src/pkg/net/http/server.go:1167 +0x7b7
created by net/http.(*Server).Serve
/home/roy/go/src/pkg/net/http/server.go:1644 +0x28b

goroutine 25 [runnable]:
net.runtime_pollWait(0x7f1de6161258, 0x72, 0x0)
/home/roy/go/src/pkg/runtime/netpoll.goc:116 +0x6a
net.(*pollDesc).Wait(0xc2100d8530, 0x72, 0x7f1de61600e8, 0xb)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:81 +0x34
net.(*pollDesc).WaitRead(0xc2100d8530, 0xb, 0x7f1de61600e8)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:86 +0x30
net.(*netFD).Read(0xc2100d84d0, 0xc2101ca000, 0x1000, 0x1000, 0x0, ...)
/home/roy/go/src/pkg/net/fd_unix.go:204 +0x2a0
net.(*conn).Read(0xc2100e9e88, 0xc2101ca000, 0x1000, 0x1000, 0x30, ...)
/home/roy/go/src/pkg/net/net.go:122 +0xc5
bufio.(*Reader).fill(0xc21011c2a0)
/home/roy/go/src/pkg/bufio/bufio.go:91 +0x110
bufio.(*Reader).Peek(0xc21011c2a0, 0x1, 0x0, 0x0, 0x0, ...)
/home/roy/go/src/pkg/bufio/bufio.go:119 +0xcb
net/http.(*persistConn).readLoop(0xc210114c80)
/home/roy/go/src/pkg/net/http/transport.go:687 +0xb7
created by net/http.(*Transport).dialConn
/home/roy/go/src/pkg/net/http/transport.go:528 +0x607

goroutine 561 [IO wait]:
net.runtime_pollWait(0x7f1de6161108, 0x72, 0x0)
/home/roy/go/src/pkg/runtime/netpoll.goc:116 +0x6a
net.(*pollDesc).Wait(0xc2101fd7d0, 0x72, 0x7f1de61600e8, 0xb)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:81 +0x34
net.(*pollDesc).WaitRead(0xc2101fd7d0, 0xb, 0x7f1de61600e8)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:86 +0x30
net.(*netFD).Read(0xc2101fd770, 0xc210231000, 0x1000, 0x1000, 0x0, ...)
/home/roy/go/src/pkg/net/fd_unix.go:204 +0x2a0
net.(*conn).Read(0xc21013c940, 0xc210231000, 0x1000, 0x1000, 0x48fcfa, ...)
/home/roy/go/src/pkg/net/net.go:122 +0xc5
bufio.(*Reader).fill(0xc21016bf00)
/home/roy/go/src/pkg/bufio/bufio.go:91 +0x110
bufio.(*Reader).Peek(0xc21016bf00, 0x1, 0xc210089360, 0x11014bba0, 0x404cd0, ...)
/home/roy/go/src/pkg/bufio/bufio.go:119 +0xcb
net/http.(*persistConn).readLoop(0xc210182880)
/home/roy/go/src/pkg/net/http/transport.go:687 +0xb7
created by net/http.(*Transport).dialConn
/home/roy/go/src/pkg/net/http/transport.go:528 +0x607

goroutine 562 [select]:
net/http.(*persistConn).writeLoop(0xc210182880)
/home/roy/go/src/pkg/net/http/transport.go:791 +0x271
created by net/http.(*Transport).dialConn
/home/roy/go/src/pkg/net/http/transport.go:529 +0x61e



--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


---
Roy Smith



Yicheng Qin

unread,
Apr 23, 2014, 6:41:51 PM4/23/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
```

[etcd] Apr 23 21:52:21.894 INFO      | Rejoin with http://127.0.0.1:7001 instead of http://etc1.songza.com:7001 from etc1.songza.com
panic: assertion failed: leader.elected.at.same.term.1
```
I assume that you don't restart etc1.
Some other etcd runs on your machine sends the join command to this cluster. And it has the name `etc1.songza.com` and peer address `http://127.0.0.1:7001`. And it should has some log before, so raft think this is a conflict.
Could you kill all distrubing etcds first? We have not implemented the functionality to prevent collision, but it is in our plan list.

Yicheng

Roy Smith

unread,
Apr 23, 2014, 6:48:32 PM4/23/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
I don't understand this.

I stopped all the etcd processes on all three machines, moved their data directories away, and restarted everything from scratch.  Maybe I just don't understand the process to start up a new cluster from scratch?  I ran these three scripts:

on etc1:
./bin/etcd \
    -addr etc1.songza.com:4001 \
    -bind-addr 0.0.0.0 \
    -peer-addr etc1.songza.com:7001 \
    -peer-bind-addr 0.0.0.0 \
    -data-dir /home/etcd/data/etc1.songza.com \
    -name etc1.songza.com

on etc2:
./bin/etcd \
    -addr etc2.songza.com:4001 \
    -bind-addr 0.0.0.0 \
    -peer-addr etc2.songza.com:7001 \
    -peer-bind-addr 0.0.0.0 \
    -data-dir data/etc2.songza.com \
    -name etc2.songza.com

on etc3:
./bin/etcd \
    -addr etc3.songza.com:4001 \
    -bind-addr 0.0.0.0 \
    -peer-addr etc3.songza.com:7001 \
    -peer-bind-addr 0.0.0.0 \
    -data-dir data/etc3.songza.com \
    -name etc3.songza.com

Is that not the right procedure?

Yicheng Qin

unread,
Apr 23, 2014, 7:16:43 PM4/23/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
This is definitely correct.
If you want to restart etc1 and make it rejoin the old cluster, the easiest way is to use the same command line as the original one. Or it may make etcd confused.

Yicheng Qin

unread,
Apr 23, 2014, 7:29:00 PM4/23/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
The story comes here is that etc1 starts without old data directory, and no peers. So it thinks that it should start a brand new cluster. After it starts successfully, etc3 sends heartbeat to it and tries to sync them because etc1 uses old peer address. But they are different cluster now! It is critical to merge two independent cluster, so it fails.

Yicheng Qin

unread,
Apr 23, 2014, 7:33:15 PM4/23/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
You could always restart with the same command line, and it should always work.
Moreover, you could change its peer address when restart. This is a new feature in master, so user could migrate etcd to different machine easily.
It also could rejoin the cluster without data dir, but you have to run etcd with discovery URL or `-peers`. This gives it the way to find the old cluster.
If you just run `./etcd`, it uses 127.0.0.1:4001 as addr, 127.0.0.1:7001 as peer-addr, some default dir as data dir, some default name as name. You could restart it without argument, and all arguments are set to be the default one.

Roy Smith

unread,
Apr 25, 2014, 4:56:00 PM4/25/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
Things are not looking good here.  I started up all three servers yesterday, using the original commands.  Everything seemed to be OK.  All three came up, and ran for a day.

Just now, I shut down two of the servers, and restarted them.  Now they won't come back up.  Etc1 appears to be running, but when I try to query it, I get:

{
   "index" : 0,
   "errorCode" : 402,
   "message" : "Standby Internal Error"
}


Etc2 just refuses to start.  When I run my startup script, I get:

$ sh go
[etcd] Apr 25 20:52:23.017 INFO      | etcd server [name etc2.songza.com, listen on 0.0.0.0:4001, advertised url http://etc2.songza.com:4001]
[etcd] Apr 25 20:52:23.031 INFO      | etc2.songza.com: peer added: 'etc3.songza.com'
[etcd] Apr 25 20:52:23.038 INFO      | etc2.songza.com: peer added: 'etc1.songza.com'
[etcd] Apr 25 20:52:23.218 INFO      | URLs: /_etcd/machines:  / etc2.songza.com (http://etc1.songza.com:7001,http://etc2.songza.com:7001,http://etc3.songza.com:7001)
[etcd] Apr 25 20:52:23.227 INFO      | Send Join Request to http://etc1.songza.com:7001/v2/admin/machines/etc2.songza.com
[etcd] Apr 25 20:52:23.229 INFO      | »»»» 307
[etcd] Apr 25 20:52:23.232 WARNING   | Attempt to join via etc1.songza.com:7001 failed: Unable to join: Put http://etc2.songza.com:7001/v2/admin/machines/etc2.songza.com: dial tcp 10.152.165.210:7001: connection refused
[etcd] Apr 25 20:52:23.237 INFO      | Send Join Request to http://etc3.songza.com:7001/v2/admin/machines/etc2.songza.com
[etcd] Apr 25 20:52:23.587 WARNING   | Attempt to join via etc3.songza.com:7001 failed: Unable to join: Put http://etc3.songza.com:7001/v2/admin/machines/etc2.songza.com: read tcp 10.122.34.52:7001: i/o timeout
[etcd] Apr 25 20:52:23.592 INFO      | Send Join Request to http://etc1.songza.com:7001/v2/admin/machines/etc2.songza.com
[etcd] Apr 25 20:52:23.595 INFO      | »»»» 307
[etcd] Apr 25 20:52:23.596 WARNING   | Attempt to join via etc1.songza.com:7001 failed: Unable to join: Put http://etc2.songza.com:7001/v2/admin/machines/etc2.songza.com: dial tcp 10.152.165.210:7001: connection refused
[etcd] Apr 25 20:52:23.598 INFO      | Send Join Request to http://etc3.songza.com:7001/v2/admin/machines/etc2.songza.com
[etcd] Apr 25 20:52:23.949 WARNING   | Attempt to join via etc3.songza.com:7001 failed: Unable to join: Put http://etc3.songza.com:7001/v2/admin/machines/etc2.songza.com: read tcp 10.122.34.52:7001: i/o timeout
[etcd] Apr 25 20:52:23.949 WARNING   | No living peers are found!
[etcd] Apr 25 20:52:23.949 INFO      | etc2.songza.com: state changed from 'initialized' to 'follower'.
[etcd] Apr 25 20:52:23.949 INFO      | peer server [name etc2.songza.com, listen on 0.0.0.0:7001, advertised url http://etc2.songza.com:7001]
[etcd] Apr 25 20:52:24.046 INFO      | Demote peer etc2.songza.com: Set mode to proxy with etc3.songza.com
[etcd] Apr 25 20:52:24.055 INFO      | etc2.songza.com: peer removed: 'etc1.songza.com'
[etcd] Apr 25 20:52:24.063 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.066 INFO      | etc2.songza.com: peer added: 'etc1.songza.com'
[etcd] Apr 25 20:52:24.079 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.085 INFO      | etc2.songza.com: state changed from 'follower' to 'stopped'.
[etcd] Apr 25 20:52:24.085 INFO      | etc2.songza.com: state changed from 'stopped' to 'stopped'.
[etcd] Apr 25 20:52:24.106 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.154 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.206 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.254 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.304 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.354 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.409 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.485 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.506 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.554 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.606 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.654 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.704 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.754 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.804 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.854 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.906 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:24.954 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.006 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.054 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.106 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.154 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.206 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.254 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.306 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.354 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.406 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.454 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.506 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.554 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.606 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.658 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.704 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.754 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.804 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.854 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.906 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:25.954 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.004 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.062 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.104 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.154 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.204 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.254 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.306 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.354 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.406 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.454 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.504 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.554 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.604 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.654 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.706 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.754 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.826 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.856 WARNING   | [ae] Error: nil response
[etcd] Apr 25 20:52:26.904 WARNING   | [ae] Error: nil response
panic: runtime error: index out of range

goroutine 33 [running]:
runtime.panic(0x7df800, 0xe18eb7)
/home/roy/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*Log).commitInfo(0xc2100527e0, 0x0, 0x0)
github.com/coreos/etcd/third_party/github.com/goraft/raft.(*server).TakeSnapshot(0xc210057240, 0x523c9, 0x0)
github.com/coreos/etcd/server.(*PeerServer).monitorSnapshot(0xc21000a0f0)
created by github.com/coreos/etcd/server.(*PeerServer).Start

goroutine 1 [IO wait]:
net.runtime_pollWait(0x7f3ef76b7300, 0x72, 0x0)
/home/roy/go/src/pkg/runtime/netpoll.goc:116 +0x6a
net.(*pollDesc).Wait(0xc210052990, 0x72, 0x7f3ef76b60e8, 0xb)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:81 +0x34
net.(*pollDesc).WaitRead(0xc210052990, 0xb, 0x7f3ef76b60e8)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:86 +0x30
net.(*netFD).accept(0xc210052930, 0x942778, 0x0, 0x7f3ef76b60e8, 0xb)
/home/roy/go/src/pkg/net/fd_unix.go:382 +0x2c2
net.(*TCPListener).AcceptTCP(0xc210000468, 0x18, 0xc210042010, 0x513993)
/home/roy/go/src/pkg/net/tcpsock_posix.go:233 +0x47
net.(*TCPListener).Accept(0xc210000468, 0x50, 0xe2bb00, 0x18, 0x0)
/home/roy/go/src/pkg/net/tcpsock_posix.go:243 +0x27
net/http.(*Server).Serve(0xc21001fe60, 0x7f3ef76b81d0, 0xc210000468, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/server.go:1622 +0x91
net/http.Serve(0x7f3ef76b81d0, 0xc210000468, 0x7f3ef76b87d0, 0xc2100b5ce0, 0x0, ...)
/home/roy/go/src/pkg/net/http/server.go:1561 +0x70
github.com/coreos/etcd/etcd.(*Etcd).Run(0xc21005a540)
/home/roy/etcd/src/github.com/coreos/etcd/etcd/etcd.go:213 +0x1d8f
main.main()
/home/roy/etcd/src/github.com/coreos/etcd/main.go:43 +0x2b2

goroutine 4 [syscall]:
os/signal.loop()
/home/roy/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
/home/roy/go/src/pkg/os/signal/signal_unix.go:27 +0x31

goroutine 5 [IO wait]:
net.runtime_pollWait(0x7f3ef76b7258, 0x72, 0x0)
/home/roy/go/src/pkg/runtime/netpoll.goc:116 +0x6a
net.(*pollDesc).Wait(0xc21020d370, 0x72, 0x7f3ef76b60e8, 0xb)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:81 +0x34
net.(*pollDesc).WaitRead(0xc21020d370, 0xb, 0x7f3ef76b60e8)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:86 +0x30
net.(*netFD).accept(0xc21020d310, 0x942778, 0x0, 0x7f3ef76b60e8, 0xb)
/home/roy/go/src/pkg/net/fd_unix.go:382 +0x2c2
net.(*TCPListener).AcceptTCP(0xc210273618, 0x513bbb, 0x7f3ef752ede0, 0x513bbb)
/home/roy/go/src/pkg/net/tcpsock_posix.go:233 +0x47
net.(*TCPListener).Accept(0xc210273618, 0x7f3ef76c5de8, 0xc210273b10, 0xc21023f800, 0x0)
/home/roy/go/src/pkg/net/tcpsock_posix.go:243 +0x27
net/http.(*Server).Serve(0xc210310960, 0x7f3ef76b81d0, 0xc210273618, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/server.go:1622 +0x91
net/http.Serve(0x7f3ef76b81d0, 0xc210273618, 0x7f3ef76b87d0, 0xc210318480, 0x0, ...)
/home/roy/go/src/pkg/net/http/server.go:1561 +0x70
/home/roy/etcd/src/github.com/coreos/etcd/etcd/etcd.go:204 +0x345
created by github.com/coreos/etcd/etcd.(*Etcd).Run
/home/roy/etcd/src/github.com/coreos/etcd/etcd/etcd.go:210 +0x1c8e

goroutine 34 [runnable]:
net.runtime_pollWait(0x7f3ef76b71b0, 0x72, 0x0)
/home/roy/go/src/pkg/runtime/netpoll.goc:116 +0x6a
net.(*pollDesc).Wait(0xc2101c3760, 0x72, 0x7f3ef76b60e8, 0xb)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:81 +0x34
net.(*pollDesc).WaitRead(0xc2101c3760, 0xb, 0x7f3ef76b60e8)
/home/roy/go/src/pkg/net/fd_poll_runtime.go:86 +0x30
net.(*netFD).Read(0xc2101c3700, 0xc21031c000, 0x1000, 0x1000, 0x0, ...)
/home/roy/go/src/pkg/net/fd_unix.go:204 +0x2a0
net.(*conn).Read(0xc210273b10, 0xc21031c000, 0x1000, 0x1000, 0x0, ...)
/home/roy/go/src/pkg/net/net.go:122 +0xc5
net/http.(*liveSwitchReader).Read(0xc21023f828, 0xc21031c000, 0x1000, 0x1000, 0x42bea2, ...)
/home/roy/go/src/pkg/net/http/server.go:204 +0xa5
io.(*LimitedReader).Read(0xc2103184c0, 0xc21031c000, 0x1000, 0x1000, 0xc2100624e0, ...)
/home/roy/go/src/pkg/io/io.go:398 +0xbb
bufio.(*Reader).fill(0xc210312cc0)
/home/roy/go/src/pkg/bufio/bufio.go:91 +0x110
bufio.(*Reader).ReadSlice(0xc210312cc0, 0xc21030660a, 0x0, 0x0, 0x0, ...)
/home/roy/go/src/pkg/bufio/bufio.go:274 +0x204
bufio.(*Reader).ReadLine(0xc210312cc0, 0x0, 0x0, 0x0, 0x0, ...)
/home/roy/go/src/pkg/bufio/bufio.go:305 +0x63
net/textproto.(*Reader).readLineSlice(0xc21027c5d0, 0x7f3ef76ad000, 0x71f980, 0x7f3ef74f6ce8, 0x421a92, ...)
/home/roy/go/src/pkg/net/textproto/reader.go:55 +0x61
net/textproto.(*Reader).ReadLine(0xc21027c5d0, 0xc2103079c0, 0x0, 0xc21031d000, 0x0)
/home/roy/go/src/pkg/net/textproto/reader.go:36 +0x27
net/http.ReadRequest(0xc210312cc0, 0xc2103079c0, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/request.go:526 +0x88
net/http.(*conn).readRequest(0xc21023f800, 0x0, 0x0, 0x0)
/home/roy/go/src/pkg/net/http/server.go:575 +0x1bb
net/http.(*conn).serve(0xc21023f800)
/home/roy/go/src/pkg/net/http/server.go:1123 +0x3b4
created by net/http.(*Server).Serve
/home/roy/go/src/pkg/net/http/server.go:1644 +0x28b

goroutine 9 [syscall]:
runtime.goexit()
/home/roy/go/src/pkg/runtime/proc.c:1394

goroutine 29 [chan receive]:
github.com/coreos/etcd/server.(*PeerServer).monitorSync(0xc21000a0f0)
created by github.com/coreos/etcd/server.(*PeerServer).Start

goroutine 31 [select]:
github.com/coreos/etcd/server.(*PeerServer).monitorActiveSize(0xc21000a0f0, 0xc2100b7c00)
created by github.com/coreos/etcd/server.(*PeerServer).Start

goroutine 30 [select]:
github.com/coreos/etcd/server.(*PeerServer).monitorTimeoutThreshold(0xc21000a0f0, 0xc2100b7c00)
created by github.com/coreos/etcd/server.(*PeerServer).Start

goroutine 32 [select]:
github.com/coreos/etcd/server.(*PeerServer).monitorPeerActivity(0xc21000a0f0, 0xc2100b7c00)
created by github.com/coreos/etcd/server.(*PeerServer).Start

--
Roy Smith



Yicheng Qin

unread,
Apr 25, 2014, 10:41:20 PM4/25/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
I think you have also played with proxy mode!
We also find some bugs in it and working hard to fix it.
Could you give out the details about how this is generated?

Yicheng

Roy Smith

unread,
Apr 25, 2014, 11:07:46 PM4/25/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
I haven't done anything with proxies (at least not intentionally).  All I've done is start up the three servers with the commands I mentioned earlier:
Initially, I started them by running those scripts manually, but then I moved those commands into upstart scripts on the various machines, and manage bringing the processes up and down with the "service" command.

After everything had been running for a day, I stopped two of the servers (etc1 and etc2) by doing "service etcd stop".  I then tried to start them back up again by doing "service etcd start".

Yicheng Qin

unread,
Apr 26, 2014, 3:39:59 PM4/26/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
It looks to me as the standby mode breaks the cluster.
We are making some fixes regarding the standby mode, but I don't know whether or not it is related.
The '/v2/stats/self' error is a bug, and I am fixing on it now. But etc1 should not be in standby mode, and etc2 should not be demoted.
Could you show me the conf and log file of etc3 under its data dir?
What is about the result of `curl -L http://etc3.songza.com:7001/v2/admin/config` and `curl -L http://etc3.songza.com:7001/v2/admin/machines`?

Thanks!

Roy Smith

unread,
Apr 27, 2014, 9:16:08 AM4/27/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
Tarball attached with conf and log files from all three machines.


{
   "promoteDelay" : 1800,
   "activeSize" : 9
}



[
   {
      "clientURL" : "http://etc1.songza.com:4001",
      "mode" : "peer",
      "name" : "etc1.songza.com",
      "peerURL" : "http://etc1.songza.com:7001"
   },
   {
      "clientURL" : "http://etc2.songza.com:4001",
      "mode" : "peer",
      "name" : "etc2.songza.com",
      "peerURL" : "http://etc2.songza.com:7001"
   },
   {
      "clientURL" : "http://etc3.songza.com:4001",
      "mode" : "peer",
      "name" : "etc3.songza.com",
      "peerURL" : "http://etc3.songza.com:7001"
   }
]







data.tar.gz

Yicheng Qin

unread,
Apr 30, 2014, 11:09:43 AM4/30/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
Hi Roy,

I got the idea of the bug finally!
It is caused by our new functionality, standby mode. I have reported the bug here: https://github.com/coreos/etcd/issues/761
The cluster will demote the node that has been disconnected from the cluster for 30 mins to standby mode, and this is the reason why there's demote command.
Currently, the replay of demote command in the log is incorrect, which prevents old node rejoin the cluster.
I am working on this to make it work: https://github.com/coreos/etcd/pull/764
Could you wait a week or so for the bug-fix HEAD?




On Apr 26, 2014, at 3:39 PM, Yicheng Qin wrote:

It looks to me as the standby mode breaks the cluster.
We are making some fixes regarding the standby mode, but I don't know whether or not it is related.
The '/v2/stats/self' error is a bug, and I am fixing on it now. But etc1 should not be in standby mode, and etc2 should not be demoted.
Could you show me the conf and log file of etc3 under its data dir?
What is about the result of `curl -L http://etc3.songza.com:7001/v2/admin/config` and `curl -L http://etc3.songza.com:7001/v2/admin/machines`?



--
Roy Smith





Roy Smith

unread,
Apr 30, 2014, 11:14:27 AM4/30/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
Thanks for figuring this out.  No problem on waiting a week; we're just exploring etcd as a new technology, so nothing we're doing is blocked by this.  I don't follow the github repo that closely; could you ping me when you've got a fix?



--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


---
Roy Smith



Yicheng Qin

unread,
Apr 30, 2014, 11:15:31 AM4/30/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
Definitely.

Roy Smith

unread,
May 21, 2014, 8:34:58 AM5/21/14
to Yicheng Qin, Brandon Philips, etcd...@googlegroups.com
I assume the new 0.4.0 release should fix this?


On Apr 30, 2014, at 11:09 AM, Yicheng Qin wrote:

--
You received this message because you are subscribed to the Google Groups "etcd-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Roy Smith



Brandon Philips

unread,
May 21, 2014, 8:47:23 AM5/21/14
to Roy Smith, Yicheng Qin, etcd...@googlegroups.com
On Wed, May 21, 2014 at 5:34 AM, Roy Smith <r...@panix.com> wrote:
> I assume the new 0.4.0 release should fix this?

Yes, it should. We changed the way standby works a bit since then so
please let us know if you encounter any problems.

Thanks,

Brandon

Roy Smith

unread,
May 21, 2014, 10:40:33 AM5/21/14
to Brandon Philips, Yicheng Qin, etcd...@googlegroups.com
Cool. I'll upgrade my servers and let you know how it goes.
> --
> You received this message because you are subscribed to the Google Groups "etcd-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to etcd-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>


---
Roy Smith
r...@panix.com



Roy Smith

unread,
May 21, 2014, 11:13:04 AM5/21/14
to Brandon Philips, Yicheng Qin, etcd...@googlegroups.com
Well, initial results are good. I copied the new binary into place on my three servers and they all restarted. One curiosity, though: The logs say:


[etcd] May 21 14:58:09.187 INFO | etc1.songza.com: peer removed: 'etc3.songza.com'
[etcd] May 21 14:58:09.193 INFO | etc1.songza.com: peer added: 'etc3.songza.com'
[etcd] May 21 14:58:09.466 INFO | etc1.songza.com: term #319 started.
[etcd] May 21 14:58:09.466 INFO | etc1.songza.com: leader changed from 'etc2.songza.com' to ''.
[etcd] May 21 14:58:09.664 INFO | etc1.songza.com: state changed from 'follower' to 'candidate'.
[etcd] May 21 14:58:09.667 INFO | etc1.songza.com: state changed from 'candidate' to 'leader'.
[etcd] May 21 14:58:09.667 INFO | etc1.songza.com: leader changed from '' to 'etc1.songza.com'.


[etcd] May 21 14:58:09.132 INFO | etc2.songza.com: peer removed: 'etc3.songza.com'
[etcd] May 21 14:58:09.148 INFO | etc2.songza.com: peer added: 'etc3.songza.com'
[etcd] May 21 14:58:09.466 INFO | etc2.songza.com: state changed from 'leader' to 'follower'.
[etcd] May 21 14:58:09.466 INFO | etc2.songza.com: term #319 started.
[etcd] May 21 14:58:09.466 INFO | etc2.songza.com: leader changed from 'etc2.songza.com' to ''.
[etcd] May 21 14:58:09.666 INFO | etc2.songza.com: term #320 started.


[etcd] May 21 14:58:09.465 INFO | etc3.songza.com: state changed from 'follower' to 'candidate'.
[etcd] May 21 14:58:09.465 INFO | etc3.songza.com: leader changed from 'etc2.songza.com' to ''.
[etcd] May 21 14:58:09.671 INFO | etc3.songza.com: state changed from 'candidate' to 'follower'.
[etcd] May 21 14:58:09.671 INFO | etc3.songza.com: term #320 started.



What does it mean when the leader is an empty string, as in:

[etcd] May 21 14:58:09.466 INFO | etc1.songza.com: leader changed from 'etc2.songza.com' to ''.

Is that normal?

Yicheng Qin

unread,
May 21, 2014, 2:12:34 PM5/21/14
to Roy Smith, Brandon Philips, etcd...@googlegroups.com
On Wed, May 21, 2014 at 8:13 AM, Roy Smith <r...@panix.com> wrote:
Well, initial results are good.  I copied the new binary into place on my three servers and they all restarted.  One curiosity, though:  The logs say:


[etcd] May 21 14:58:09.187 INFO      | etc1.songza.com: peer removed: 'etc3.songza.com'
[etcd] May 21 14:58:09.193 INFO      | etc1.songza.com: peer added: 'etc3.songza.com'
[etcd] May 21 14:58:09.466 INFO      | etc1.songza.com: term #319 started.
[etcd] May 21 14:58:09.466 INFO      | etc1.songza.com: leader changed from 'etc2.songza.com' to ''.
[etcd] May 21 14:58:09.664 INFO      | etc1.songza.com: state changed from 'follower' to 'candidate'.
[etcd] May 21 14:58:09.667 INFO      | etc1.songza.com: state changed from 'candidate' to 'leader'.
[etcd] May 21 14:58:09.667 INFO      | etc1.songza.com: leader changed from '' to 'etc1.songza.com'.


[etcd] May 21 14:58:09.132 INFO      | etc2.songza.com: peer removed: 'etc3.songza.com'
[etcd] May 21 14:58:09.148 INFO      | etc2.songza.com: peer added: 'etc3.songza.com'
[etcd] May 21 14:58:09.466 INFO      | etc2.songza.com: state changed from 'leader' to 'follower'.
[etcd] May 21 14:58:09.466 INFO      | etc2.songza.com: term #319 started.
[etcd] May 21 14:58:09.466 INFO      | etc2.songza.com: leader changed from 'etc2.songza.com' to ''.
[etcd] May 21 14:58:09.666 INFO      | etc2.songza.com: term #320 started.


[etcd] May 21 14:58:09.465 INFO      | etc3.songza.com: state changed from 'follower' to 'candidate'.
[etcd] May 21 14:58:09.465 INFO      | etc3.songza.com: leader changed from 'etc2.songza.com' to ''.
[etcd] May 21 14:58:09.671 INFO      | etc3.songza.com: state changed from 'candidate' to 'follower'.
[etcd] May 21 14:58:09.671 INFO      | etc3.songza.com: term #320 started.



What does it mean when the leader is an empty string, as in:

[etcd] May 21 14:58:09.466 INFO      | etc1.songza.com: leader changed from 'etc2.songza.com' to ''.

Is that normal?
Yes.
The sentence indicates that it comes to the period of leader election, and there's no leader for the cluster now.
`[etcd] May 21 14:58:09.667 INFO      | etc1.songza.com: leader changed from '' to 'etc1.songza.com'.`
This line says that etc1 is voted to be the leader during the election.

Thanks,
Yicheng
Reply all
Reply to author
Forward
0 new messages