docker failing to pick up flannel settings

1,017 views
Skip to first unread message

Brian Rosner

unread,
Mar 18, 2015, 1:05:12 PM3/18/15
to coreo...@googlegroups.com
I am trying to understand why a particular user-data file causes docker to start without the flannel environment. Here is my user-data script:

#cloud-config
coreos:
  etcd:
    discovery: {{generateDiscoveryUrl}}
    addr: $private_ipv4:4001
    peer-addr: $private_ipv4:7001
    peer-election-timeout: 2000
    peer-heartbeat-interval: 500
  fleet:
    public-ip: $private_ipv4
    etcd_request_timeout: 3.0
    metadata: "{{fleetMetadata}}"
  flannel:
    interface: $private_ipv4
  units:
    - name: flanneld.service
      command: start
      drop-ins:
        - name: 50-network-config.conf
          content: |
            [Service]
            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "172.30.0.0/16" }'
    - name: etcd.service
      command: start
    - name: fleet.socket
      command: start
      drop-ins:
      - name: 30-ListenStream.conf
        content: |
          [Socket]
          ListenStream=127.0.0.1:49153
    - name: fleet.service
      command: start
    - name: discoverd.service
      command: start
      content: |
        [Unit]
        Description=discoverd
        After=docker.service

        [Service]
        User=core
        ExecStartPre=/usr/bin/docker pull flynn/discoverd
        ExecStart=/usr/bin/docker run --rm --name=discoverd --net=host flynn/discoverd -etcd http://127.0.0.1:4001
        ExecStop=/usr/bin/docker stop discoverd
        Restart=always
        RestartSec=10s
        TimeoutStartSec=120
        TimeoutStopSec=15

If I don't include the discoverd.service unit everything appears to work fine. If I boot my cluster with the user-data without the discoverd.service then run docker pull flynn/discoverd, everything works. docker is started with the variables set in /run/flannel_docker_opts.env. However, this is not the case if systemd starts discoverd. The docker.service is started, but is missing all the environment variables.

Does anyone know what I am doing wrong here?

eugene.y...@coreos.com

unread,
Mar 18, 2015, 4:06:43 PM3/18/15
to coreo...@googlegroups.com
This usually happens if docker is started before flannel. However it shouldn't as flanneld.service is listed first in cloud-config.

Can you run "journalctl -b -u flanneld.service -u discoverd.service -u docker.service" to see the interleaving and the ordering?

Brian Rosner

unread,
Mar 19, 2015, 7:16:30 AM3/19/15
to coreo...@googlegroups.com
On Wednesday, March 18, 2015 at 2:06:43 PM UTC-6, eugene.y...@coreos.com wrote:
This usually happens if docker is started before flannel. However it shouldn't as flanneld.service is listed first in cloud-config.

Can you run "journalctl -b -u flanneld.service -u discoverd.service -u docker.service" to see the interleaving and the ordering?

I suspected this was the case. It never occurred to me to look at the logs this way though. Good call. Here's what I am seeing:

Mar 19 11:13:06 core-1 systemd[1]: Starting etcd...
Mar 19 11:13:06 core-1 systemd[1]: Started etcd.
Mar 19 11:13:06 core-1 systemd[1]: Starting Network fabric for containers...
Mar 19 11:13:06 core-1 etcd[905]: [etcd] Mar 19 11:13:06.311 INFO      | Discovery via https://discovery.etcd.io using prefix /ba1fb75ca25f27451f061db127144648.
Mar 19 11:13:06 core-1 etcdctl[925]: Error:  Cannot sync with the cluster using peers 127.0.0.1:4001
Mar 19 11:13:06 core-1 systemd[1]: flanneld.service: control process exited, code=exited status=2
Mar 19 11:13:06 core-1 systemd[1]: Failed to start Network fabric for containers.
Mar 19 11:13:06 core-1 systemd[1]: Unit flanneld.service entered failed state.
Mar 19 11:13:06 core-1 systemd[1]: flanneld.service failed.
Mar 19 11:13:06 core-1 systemd[1]: Started etcd.
Mar 19 11:13:06 core-1 systemd[1]: Starting discoverd...
Mar 19 11:13:06 core-1 systemd[1]: Starting Docker Application Container Engine...
Mar 19 11:13:06 core-1 systemd[1]: Started Docker Application Container Engine.
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="+job serveapi(fd://)"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="+job init_networkdriver()"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="Listening for HTTP on fd ()"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="-job init_networkdriver() = OK (0)"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="Loading containers: start."
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="Loading containers: done."
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="docker daemon: 1.5.0 a8a31ef-dirty; execdriver: native-0.2; graphdriver: overlay"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="+job acceptconnections()"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="-job acceptconnections() = OK (0)"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="POST /v1.17/images/create?fromImage=flynn%2Fdiscoverd%3Alatest"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="+job pull(flynn/discoverd, latest)"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="+job resolve_repository(flynn/discoverd)"
Mar 19 11:13:06 core-1 dockerd[948]: time="2015-03-19T11:13:06Z" level="info" msg="-job resolve_repository(flynn/discoverd) = OK (0)"
Mar 19 11:13:06 core-1 docker[937]: Pulling repository flynn/discoverd
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.019 INFO      | Discovery _state was empty, so this machine is the initial leader.
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.019 INFO      | Discovery fetched back peer list: []
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.019 INFO      | 8c067c441f0141ae95c61f4481da37af is starting a new cluster
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.023 INFO      | etcd server [name 8c067c441f0141ae95c61f4481da37af, listen on :4001, advertised url http://172.17.8.100:4001]
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.023 INFO      | peer server [name 8c067c441f0141ae95c61f4481da37af, listen on :7001, advertised url http://172.17.8.100:7001]
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.023 INFO      | 8c067c441f0141ae95c61f4481da37af starting in peer mode
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.023 INFO      | 8c067c441f0141ae95c61f4481da37af: state changed from 'initialized' to 'follower'.
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.023 INFO      | failed to get local etcd versions: open /usr/libexec/etcd/internal_versions/: no such file or directory
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.023 INFO      | 8c067c441f0141ae95c61f4481da37af: state changed from 'follower' to 'leader'.
Mar 19 11:13:07 core-1 etcd[905]: [etcd] Mar 19 11:13:07.023 INFO      | 8c067c441f0141ae95c61f4481da37af: leader changed from '' to '8c067c441f0141ae95c61f4481da37af'.
Mar 19 11:13:08 core-1 docker[937]: b9f7ecc9c143: Pulling image (latest) from flynn/discoverd
Mar 19 11:13:08 core-1 docker[937]: b9f7ecc9c143: Pulling image (latest) from flynn/discoverd, endpoint: https://registry-1.docker.io/v1/
Mar 19 11:13:08 core-1 docker[937]: b9f7ecc9c143: Pulling dependent layers
Mar 19 11:13:08 core-1 docker[937]: 511136ea3c5a: Pulling metadata
Mar 19 11:13:09 core-1 docker[937]: 511136ea3c5a: Pulling fs layer
Mar 19 11:13:10 core-1 docker[937]: 511136ea3c5a: Download complete
Mar 19 11:13:10 core-1 docker[937]: a643fa1f2742: Pulling metadata
Mar 19 11:13:11 core-1 docker[937]: a643fa1f2742: Pulling fs layer
Mar 19 11:13:11 core-1 systemd[1]: flanneld.service holdoff time over, scheduling restart.
Mar 19 11:13:11 core-1 systemd[1]: Stopping Network fabric for containers...
Mar 19 11:13:11 core-1 systemd[1]: Starting Network fabric for containers...
Mar 19 11:13:11 core-1 etcdctl[1036]: { "Network": "172.30.0.0/16" }
Mar 19 11:13:11 core-1 sdnotify-proxy[1043]: Unable to find image 'quay.io/coreos/flannel:0.3.0' locally
 
It perfectly shows the ordering and how it is wrong. Is this an issue with flannel starting too early or should I add a requirement to my discoverd to ensure flannel has started?

eugene.y...@coreos.com

unread,
Mar 19, 2015, 1:40:53 PM3/19/15
to coreo...@googlegroups.com
Hi Brian,

Here's the problem:

1. flanneld.service starts and executes ExecStartPre=etcdctl.... etcd is not fully up (it gets socket activated but it hasn't bootstrapped into a cluster). etcdctl fails and the whole unit fails. flanneld.service will get restarted later.
2. Since flanneld.service failed, it causes cloud-init to continue and ultimately start discoverd.
3. discoverd socket activates docker and it starts before flannel.

The easiest fix is probably to add Requires=flanneld.service and After=flanneld.service to discoverd.serivce. When discoverd.service will get started by cloud-config, it will fail (because required service failed) but it'll get retried and succeed on the next try or two.

For larger deployments we recommend having separate machines (3 of them) running etcd and pointing services (e.g. flannel, fleet) over to them (this can also be done via cloud-config). This way the etcdctl wouldn't fail. This method should also get easier once etcd 2.0 is shipped as it'll allow for local etcd instance to just proxy over to the real cluster.

I realized that this is a CoreOS problem. We need to better document the usage and the potential gotcha's. We also need to make flannel easier to use during startup with other services.

Best,
Eugene

Brian Rosner

unread,
Mar 21, 2015, 10:21:41 AM3/21/15
to coreo...@googlegroups.com


On Thursday, March 19, 2015 at 11:40:53 AM UTC-6, eugene.y...@coreos.com wrote:
The easiest fix is probably to add Requires=flanneld.service and After=flanneld.service to discoverd.serivce. When discoverd.service will get started by cloud-config, it will fail (because required service failed) but it'll get retried and succeed on the next try or two.

This is what I ended up doing. It is working for me.
 
For larger deployments we recommend having separate machines (3 of them) running etcd and pointing services (e.g. flannel, fleet) over to them (this can also be done via cloud-config). This way the etcdctl wouldn't fail. This method should also get easier once etcd 2.0 is shipped as it'll allow for local etcd instance to just proxy over to the real cluster.

As I plan my future clusters I am beginning to see the value in doing this. I have played around with etcd 2.0 and I agree it will be much better.
 
I realized that this is a CoreOS problem. We need to better document the usage and the potential gotcha's. We also need to make flannel easier to use during startup with other services.

Thank you for taking time to answer my question and I am looking to become much more involved in CoreOS. I would love to help out as I continue to dive deeper into the world. I am even headed to CoreOS Fest in May!
 
Thanks again!

Brian Rosner

Kevin Nguyen

unread,
Jul 27, 2015, 5:45:10 PM7/27/15
to CoreOS User, bro...@gmail.com
Hi all,

I am seeing this problem too, and do not see a discoverd.service. Is that from an older release? Also, how do we go about amending fields to the unit file? I see some examples of adding drop-ins, but not modifying a `Requires` and an `After`.

Thanks!
Message has been deleted

Kevin Nguyen

unread,
Jul 27, 2015, 8:06:15 PM7/27/15
to CoreOS User, bro...@gmail.com, ke...@collectivehealth.com
Sorry, I thought I fixed it with the new etcd2, but seems like it's still a problem. Can you guys elaborate on the fix?

Kevin Nguyen

unread,
Jul 27, 2015, 8:40:36 PM7/27/15
to CoreOS User, bro...@gmail.com, ke...@collectivehealth.com
Do you think adding

- name: docker.service
      drop-ins:
        - name: 50-flannel.conf
          content: |
            [Unit]
            Requires=flanneld.service
            After=flanneld.service
      command: start

to the cloud config would work? Is there anything that could go wrong with this approach?

Greg Fausak

unread,
Jul 28, 2015, 8:41:32 AM7/28/15
to CoreOS User, bro...@gmail.com, ke...@collectivehealth.com
It looks good.  Can you give it a try to see if it works?

My docker.service file looks like this:

[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=docker.socket early-docker.target network.target
Requires=docker.socket early-docker.target

[Service]
EnvironmentFile=-/run/flannel_docker_opts.env
MountFlags=slave
LimitNOFILE=1048576
LimitNPROC=1048576
ExecStart=/usr/lib/coreos/dockerd --daemon --host=fd:// $DOCKER_OPTS $DOCKER_OPT_BIP $DOCKER_OPT_MTU $DOCKER_OPT_IPMASQ

[Install]

WantedBy=multi-user.target


Note the EnvironmentFile=/run/flannel_docker_opts.env. This file is created by the flanneld.service unit. So, as long as the flanneld.service runs before the docker.service you should be in good shape. The drop in you declare acts as a modifier to the existing unit file, as if your commands were part of it.  You modify the dependency and startup order relative to docker and flannel.

Mini rant:

I personally have a real hard time with Before, After, Requires, and Wants in the unit file. I think that Requires defines a runtime dependency (if A Requires B then B must be started for A to start).  Requires does not impose any order. In fact, the manual states that if A requires B and no other modifiers are present, then they will be started at the same time.

Before and After imply order.  A After B will achieve B starting before unit A.

Here is where I get confused. I can't find a definition for 'Starting' or better 'Started' in the systemd docs. My understanding is 'starting' means that all ExecStartPre, ExecStart, and ExecStartPost are executed.  Since ExecStart is never returned from (well, except oneshot type, but that is something else), we can't tell if the process is done.  So Before or After don't help if the unit doesn't have any ExecStartPost. Luckily ExecStartPost is in the flanneld.service unit. In the ExecStartPost is the 'write' to the /run/flannel_docker_opts.env file.  Since ExecStartPre and ExecStartPost are run serially and they *must* exit (cleanly) before they are considered done, this is perfect for setting a real After. ExecStartPost must exit cleanly for this unit to be considered done (and After units to be considered).

In other words, it is my opinion that After (or Before) don't help much if the unit they depend on only has an ExecStart because the act of launching a process doesn't mean that it has done anything yet (it could take a while for a process to do what you need it to before another dependency starts).

Sorry, rant over :-)

-g

Kevin Nguyen

unread,
Jul 29, 2015, 6:32:48 PM7/29/15
to CoreOS User, bro...@gmail.com, lgfa...@gmail.com
Awesome, thanks for taking a look Greg. I seem to have it working consistently in vagrant. :)
Reply all
Reply to author
Forward
0 new messages