Docker didn't started after CoreOS upgrade

704 views
Skip to first unread message

ale...@chorally.com

unread,
Mar 29, 2016, 6:36:08 AM3/29/16
to CoreOS User
Hello,

I've got a "single node cluster" running on AWS, I'm on stable channel and my machine rebooted due to CoreOS upgrade on 2016-03-26.

After the reboot docker.service failed to start and so did all the dependant units.

I had to manually start all of them via systemctl and now everything is up and running.


This machine is managed with ansible so it has minimal AWS user-data and a cloud-config.yml uploaded in the user home.
There are other systemd units that are loaded and enabled with ansible roles.

Full boot log here: http://pastebin.com/tXZHkZCY
I think the relevant part is here:

Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Failed to start Network fabric for containers.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd-resolved[831]: System hostname changed to 'ip-172-17-0-21'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd-hostnamed[838]: Changed static host name to 'ip-172-17-0-21.service.consul'
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd-hostnamed[838]: Changed host name to 'ip-172-17-0-21.service.consul'
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for internal Custom Fields Manager Docker Container.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: production-internal-cfm-web.service: Job production-internal-cfm-web.service/start failed with result 'dependency'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for internal Kerberos Docker Container.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: vpn.service: Job vpn.service/start failed with result 'dependency'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for bolton Monitor Workers Docker Container.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: production-bolton-monitor-workers.service: Job production-bolton-monitor-workers.service/start failed with result 'dependency'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for nginx Docker Container.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: nginx.service: Job nginx.service/start failed with result 'dependency'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for bolton Redis Docker Container.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: production-bolton-redis.service: Job production-bolton-redis.service/start failed with result 'dependency'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for Docker Application Container Engine.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for Consul container.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for Registrator container.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: docker-registrator.service: Job docker-registrator.service/start failed with result 'dependency'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: docker-consul.service: Job docker-consul.service/start failed with result 'dependency'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: docker.service: Job docker.service/start failed with result 'dependency'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: flanneld.service: Unit entered failed state.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: flanneld.service: Failed with result 'exit-code'.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Reloading.
Mar 26 01:30:34 ip-172-17-0-21.service.consul dockerd[760]: time="2016-03-26T01:30:34.451055460Z" level=info msg="Firewalld running: false"
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: libcontainer-760-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Starting etcd2...
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Starting Garbage Collection for rkt...
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Starting Generate /run/coreos/motd...
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Reached target Run Docker containers before main Docker starts up.
Mar 26 01:30:34 ip-172-17-0-21.service.consul coreos-cloudinit[809]: 2016/03/26 01:30:34 Ensuring runtime unit file "etcd2.service" is unmasked
Mar 26 01:30:34 ip-172-17-0-21.service.consul coreos-cloudinit[809]: 2016/03/26 01:30:34 Ensuring runtime unit file "fleet.service" is unmasked
Mar 26 01:30:34 ip-172-17-0-21.service.consul coreos-cloudinit[809]: 2016/03/26 01:30:34 Ensuring runtime unit file "locksmithd.service" is unmasked
Mar 26 01:30:34 ip-172-17-0-21.service.consul coreos-cloudinit[809]: 2016/03/26 01:30:34 Calling unit command "restart" on "oem-cloudinit.service"'
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Started Garbage Collection for rkt.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Started Generate /run/coreos/motd.
Mar 26 01:30:34 ip-172-17-0-21.service.consul dockerd[760]: time="2016-03-26T01:30:34.690490617Z" level=info msg="Loading containers: start."
Mar 26 01:30:34 ip-172-17-0-21.service.consul dockerd[760]: time="2016-03-26T01:30:34.691007939Z" level=info msg="Loading containers: done."
Mar 26 01:30:34 ip-172-17-0-21.service.consul dockerd[760]: time="2016-03-26T01:30:34.691024843Z" level=info msg="Daemon has completed initialization"
Mar 26 01:30:34 ip-172-17-0-21.service.consul dockerd[760]: time="2016-03-26T01:30:34.691074403Z" level=info msg="Docker daemon" commit=9894698 execdriver=native-0.2 graphdriver=overlay version=1.9.1
Mar 26 01:30:34 ip-172-17-0-21.service.consul dockerd[760]: time="2016-03-26T01:30:34.696872089Z" level=info msg="API listen on /var/run/early-docker.sock"
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Reached target Multi-User System.


$ curl http://169.254.169.254/2009-04-04/user-data

#cloud-config


coreos:

  etcd2:

    #single node cluster setup

    name: core0

    initial-cluster: core0=http://$private_ipv4:2380

    initial-advertise-peer-urls: http://$private_ipv4:2380

    advertise-client-urls: http://$private_ipv4:2379

    # listen on both the official ports and the legacy ports

    # legacy ports can be omitted if your application doesn't depend on them

    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001

    listen-peer-urls: http://0.0.0.0:2380

    # generate a new token for each unique cluster from https://discovery.etcd.io/new?size=3

    # specify the initial size of your cluster with ?size=X

    #discovery: https://discovery.etcd.io/e42cedcdc7c8df727706971000a973ad

    # multi-region and multi-cloud deployments need to use $public_ipv4

    #advertise-client-urls: http://$private_ipv4:2379,http://$public_ipv4:2379

    #initial-advertise-peer-urls: http://$private_ipv4:2380,http://$public_ipv4:7001

    # listen on both the official ports and the legacy ports

    # legacy ports can be omitted if your application doesn't depend on them

    #listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001

    #listen-peer-urls: http://$private_ipv4:2380,http://$public_ipv4:7001

  update:

    reboot-strategy: "etcd-lock"

  fleet:

    public-ip: "$public_ipv4"

  flannel:

    public-ip: "$public_ipv4"

  units:

    - name: etcd2.service

      command: start

    - name: fleet.service

      command: start

    - name: var-lib-docker.mount

      command: start

      content: |

        [Unit]

        Description=Mount ephemeral0 to /var/lib/docker

        Requires=dev-xvdb.device

        [Mount]

        What=/dev/xvdb

        Where=/var/lib/docker

        Type=ext3

    - name: cache.mount

      command: start

      content: |

        [Unit]

        Description=Mount ephemeral1 to /cache

        Requires=dev-xvdc.device

        [Mount]

        What=/dev/xvdc

        Where=/cache

        Type=ext3

    - name: flanneld.service

      drop-ins:

        - name: 50-network-config.conf

          content: |

            [Service]

            ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "Network": "10.17.0.0/16", "Backend": {"Type": "aws-vpc"} }'

      command: start

    - name: docker.service

      command: start

      drop-ins:

        - name: 10-wait-docker.conf

          content: |

            [Unit]

            After=var-lib-docker.mount

            After=cache.mount

            Requires=var-lib-docker.mount

            Requires=cache.mount

            Requires=flanneld.service

            After=flanneld.service



$ cat /home/core/cloud-config.yml 

#cloud-config


hostname: core-0

coreos:

  etcd2:

    #single node cluster setup

    name: core0

    initial-cluster: core0=http://$private_ipv4:2380

    initial-advertise-peer-urls: http://$private_ipv4:2380

    advertise-client-urls: http://$private_ipv4:2379

    # listen on both the official ports and the legacy ports

    # legacy ports can be omitted if your application doesn't depend on them

    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001

    listen-peer-urls: http://0.0.0.0:2380


  fleet:

    public-ip: "$public_ipv4"

    metadata: "dc=aws-eu-west-1,availability_zone=eu-west-1a"

  flannel:

    public-ip: "$public_ipv4"

  update:

    reboot-strategy: "best-effort"

  units:

    - name: etcd2.service

      command: start

    - name: fleet.service

      command: start

    - name: "docker-consul.service"

      command: "start"

      enable: true

      content: |

        [Unit]

        Description=Consul container

        After=docker.service

        Requires=docker.service


        [Service]

        TimeoutStartSec=0

        ExecStartPre=-/usr/bin/docker kill consul

        ExecStartPre=-/usr/bin/docker rm consul

        ExecStartPre=/usr/bin/docker pull gliderlabs/consul-agent:0.6

        ExecStart=/usr/bin/docker run --rm --name consul \

          -v /data/consul:/data -v /etc/consul.d/:/etc/consul.d:ro \

          -v /var/run/docker.sock:/var/run/docker.sock --net=host \

          -p 8300:8300 -p 8301:8301 -p 8301:8301/udp \

          -p 8400:8400 -p 127.0.0.1:8500:8500 -p 53:8600/udp \

          -e SERVICE_8500_NAME=consul -e SERVICE_TAGS="http,monitoring" \

          -e SERVICE_8300_IGNORE=true -e SERVICE_8301_IGNORE=true \

          -e SERVICE_8400_IGNORE=true -e SERVICE_8600_IGNORE=true \

          gliderlabs/consul-agent:0.6 -ui-dir /ui -config-dir=/etc/consul.d \

          -advertise=$private_ipv4 -node=core-0

        ExecStop=/usr/bin/docker stop consul


        [Install]

        WantedBy=multi-user.target

    - name: "docker-registrator.service"

      command: "start"

      enable: true

      content: |

        [Unit]

        Description=Registrator container

        After=docker-consul.service

        Requires=docker-consul.service


        [Service]

        TimeoutStartSec=0

        ExecStartPre=-/usr/bin/docker kill registrator

        ExecStartPre=-/usr/bin/docker rm registrator

        ExecStartPre=/usr/bin/docker pull gliderlabs/registrator:latest

        ExecStart=/usr/bin/docker run --rm --name registrator \

          -v /var/run/docker.sock:/tmp/docker.sock --net=host \

          gliderlabs/registrator -internal -resync 60 consul://127.0.0.1:8500

        ExecStop=/usr/bin/docker stop registrator


        [Install]

        WantedBy=multi-user.target



Thank you,
Alessio

Kai Hendry

unread,
Mar 29, 2016, 9:06:10 AM3/29/16
to CoreOS User
They probably didn't start because your network wasn't ready in my limited experience.

My workaround is to put:

RestartSec=5
Restart=always

In your docker service files.

Be nicer if you split your logs by the docker service that you needed to restart. E.g.

journalctl -u docker-consul -e > /tmp/consul-log.txt

Alessio Caiazza

unread,
Mar 29, 2016, 9:17:28 AM3/29/16
to CoreOS User, Kai Hendry

> Il giorno 29 mar 2016, alle ore 15:06, Kai Hendry <kai.h...@gmail.com> ha scritto:
>
> They probably didn't start because your network wasn't ready in my limited experience.
>
> My workaround is to put:
>
> RestartSec=5
> Restart=always
>
> In your docker service files.

Thank you, but it looks like that docker daemon itself didn't started and this failed all the docker container.
In any case I will add the restart options also.

>
> Be nicer if you split your logs by the docker service that you needed to restart. E.g.
>
> journalctl -u docker-consul -e > /tmp/consul-log.txt
>

Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for Consul container.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: docker-consul.service: Job docker-consul.service/start failed with result 'dependency'.


Here are the docker.service logs

Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: Dependency failed for Docker Application Container Engine.
Mar 26 01:30:34 ip-172-17-0-21.service.consul systemd[1]: docker.service: Job docker.service/start failed with result 'dependency'.
Mar 26 01:30:47 ip-172-17-0-21.service.consul systemd[1]: Started Docker Application Container Engine.
Mar 26 01:30:47 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T01:30:47.805684127Z" level=warning msg="Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinp
Mar 26 01:30:47 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T01:30:47.838770820Z" level=info msg="[graphdriver] using prior storage driver \"devicemapper\""
Mar 26 01:30:48 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T01:30:48.479388161Z" level=info msg="Firewalld running: false"
Mar 26 01:30:48 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T01:30:48.560204812Z" level=info msg="Loading containers: start."
Mar 26 01:30:48 ip-172-17-0-21.service.consul dockerd[1035]: ......................................
Mar 26 01:30:48 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T01:30:48.583607618Z" level=info msg="Loading containers: done."
Mar 26 01:30:48 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T01:30:48.583625220Z" level=info msg="Daemon has completed initialization"
Mar 26 01:30:48 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T01:30:48.583640579Z" level=info msg="Docker daemon" commit=9894698 execdriver=native-0.2 graphdriver=devicemapper version=1.9.1
Mar 26 01:30:48 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T01:30:48.588703574Z" level=info msg="API listen on /var/run/docker.sock"
Mar 26 18:20:11 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T18:20:11.317615278Z" level=info msg="GET /v1.21/containers/json"
Mar 26 18:20:45 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T18:20:45.732401459Z" level=info msg="POST /v1.21/containers/production-customer-monitor-web/exec"
Mar 26 18:20:45 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T18:20:45.732737669Z" level=error msg="Error setting up exec command in container production-bolton-monitor-web: notrunning: Container production
Mar 26 18:20:45 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T18:20:45.732759282Z" level=error msg="Handler for POST /v1.21/containers/production-customer-monitor-web/exec returned error: Container production
Mar 26 18:20:45 ip-172-17-0-21.service.consul dockerd[1035]: time="2016-03-26T18:20:45.732776195Z" level=error msg="HTTP Error" err="Container production-customer-monitor-web is not running" statusCode=500


Alex Crawford

unread,
Mar 29, 2016, 5:01:29 PM3/29/16
to ale...@chorally.com, CoreOS User
On 03/29, ale...@chorally.com wrote:
> Full boot log here: http://pastebin.com/tXZHkZCY

It looks like flannel failed (exited with code: 4).

What does `journalctl -b -u flanneld --no-pager` show?

-Alex
signature.asc

Alessio Caiazza

unread,
Mar 30, 2016, 3:34:12 AM3/30/16
to Alex Crawford, CoreOS User
It looks like it didn't logged for the first 2 days...

# journalctl -b -u flanneld --no-pager
-- Logs begin at Mon 2016-03-28 03:38:40 UTC, end at Wed 2016-03-30 07:31:19 UTC. --
Mar 28 22:30:47 ip-172-17-0-21.service.consul sdnotify-proxy[961]: I0328 22:30:47.489524 00001 renew.go:41] Lease renewed, new expiration: 2016-03-29 22:30:47.488984671 +0000 UTC
Mar 29 21:30:47 core-0 sdnotify-proxy[961]: I0329 21:30:47.490263 00001 renew.go:41] Lease renewed, new expiration: 2016-03-30 21:30:47.489554107 +0000 UTC


Alessio
Reply all
Reply to author
Forward
0 new messages