Kubernetes constantly restarting Composer Worker pods

4,404 views
Skip to first unread message

Ben Vogan

unread,
Aug 31, 2018, 2:50:10 PM8/31/18
to cloud-composer-discuss
We are unable to use our composer environment at the moment because kubernetes is constantly restarting the worker pods.  We are also seeing issues with the scheduler being stuck in a crash loop and having the workload suspended.  This happens even when there are no DAGs executing on the cluster.

C:\Users\Ben\AppData\Local\Google\Cloud SDK>kubectl get pods
NAME                                 READY     STATUS    RESTARTS   AGE
airflow-redis-0                      1/1       Running   8          16d
airflow-scheduler-f4d6476b5-qrjz8    2/2       Running   111        11h
airflow-sqlproxy-9f9488c95-fnpjd     1/1       Running   8          16d
airflow-webserver-6679f6447f-s742s   2/2       Running   16         15h
airflow-worker-549d57f499-2pjzw      2/2       Running   1276       16d
airflow-worker-549d57f499-bhk25      2/2       Running   1359       16d
airflow-worker-549d57f499-cp47x      2/2       Running   18         11h
airflow-worker-549d57f499-sznwf      2/2       Running   18         13h

Looking at the stackdriver logs for GKE I see logs for the container being created, then started, then a bunch of these:

{
 insertId:  "1ulsnvmfdjacg5"  
 jsonPayload: {
  apiVersion:  "v1"   
  involvedObject: {
   apiVersion:  "v1"    
   fieldPath:  "spec.containers{airflow-worker}"    
   kind:  "Pod"    
   name:  "airflow-worker-549d57f499-2pjzw"    
   namespace:  "default"    
   resourceVersion:  "6297657"    
   uid:  "c07a3081-a013-11e8-9038-42010a800056"    
  }
  kind:  "Event"   
  message:  "Liveness probe failed: Traceback (most recent call last):
  File "/var/local/worker_checker.py", line 23, in <module>
    main()
  File "/var/local/worker_checker.py", line 17, in main
    checker_lib.host_name, task_counts))
Exception: Worker airflow-worker-549d57f499-2pjzw seems to be dead. Task counts details:{'scheduled': 68L, 'recently_done': 0L, 'running': 0L, 'queued': 26L}
"   
  metadata: {
   creationTimestamp:  "2018-08-31T10:46:55Z"    
   name:  "airflow-worker-549d57f499-2pjzw.154ff24e40f206de"    
   namespace:  "default"    
   resourceVersion:  "93807"    
   selfLink:  "/api/v1/namespaces/default/events/airflow-worker-549d57f499-2pjzw.154ff24e40f206de"    
   uid:  "2dc156d4-ad0b-11e8-9038-42010a800056"    
  }
  reason:  "Unhealthy"   
  source: {
   component:  "kubelet"    
   host:  "gke-us-central1-sk-cloud-default-pool-0629bc4f-kjd1"    
  }
  type:  "Warning"   
 }
 logName:  "projects/sk-data-platform/logs/events"  
 receiveTimestamp:  "2018-08-31T10:59:00.683337131Z"  
 resource: {
  labels: {
   cluster_name:  "us-central1-sk-cloud-compos-fd6c07e8-gke"    
   location:  ""    
   project_id:  "sk-data-platform"    
  }
  type:  "gke_cluster"   
 }
 severity:  "WARNING"  
 timestamp:  "2018-08-31T10:58:55Z"  
}

Followed by:
{
 insertId:  "3lz1pxf9os8cu"  
 jsonPayload: {
  apiVersion:  "v1"   
  involvedObject: {
   apiVersion:  "v1"    
   fieldPath:  "spec.containers{airflow-worker}"    
   kind:  "Pod"    
   name:  "airflow-worker-549d57f499-2pjzw"    
   namespace:  "default"    
   resourceVersion:  "6297657"    
   uid:  "c07a3081-a013-11e8-9038-42010a800056"    
  }
  kind:  "Event"   
  message:  "Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded"   
  metadata: {
   creationTimestamp:  "2018-08-31T12:06:57Z"    
   name:  "airflow-worker-549d57f499-2pjzw.154fdbd2f51fddc0"    
   namespace:  "default"    
   resourceVersion:  "94180"    
   selfLink:  "/api/v1/namespaces/default/events/airflow-worker-549d57f499-2pjzw.154fdbd2f51fddc0"    
   uid:  "5b7f584b-ad16-11e8-9038-42010a800056"    
  }
  reason:  "Unhealthy"   
  source: {
   component:  "kubelet"    
   host:  "gke-us-central1-sk-cloud-default-pool-0629bc4f-kjd1"    
  }
  type:  "Warning"   
 }
 logName:  "projects/sk-data-platform/logs/events"  
 receiveTimestamp:  "2018-08-31T14:06:31.233905357Z"  
 resource: {
  labels: {
   cluster_name:  "us-central1-sk-cloud-compos-fd6c07e8-gke"    
   location:  ""    
   project_id:  "sk-data-platform"    
  }
  type:  "gke_cluster"   
 }
 severity:  "WARNING"  
 timestamp:  "2018-08-31T13:40:57Z"  
}

Followed by:
{
 insertId:  "8byyzyfab60yl"  
 jsonPayload: {
  apiVersion:  "v1"   
  involvedObject: {
   apiVersion:  "v1"    
   fieldPath:  "spec.containers{airflow-worker}"    
   kind:  "Pod"    
   name:  "airflow-worker-549d57f499-2pjzw"    
   namespace:  "default"    
   resourceVersion:  "6297657"    
   uid:  "c07a3081-a013-11e8-9038-42010a800056"    
  }
  kind:  "Event"   
  message:  "Liveness probe failed: No handlers could be found for logger "airflow.logging_config"
Traceback (most recent call last):
  File "/var/local/worker_checker.py", line 10, in <module>
    import checker_lib
  File "/var/local/checker_lib.py", line 6, in <module>
    from airflow import models
  File "/usr/local/lib/python2.7/site-packages/airflow/__init__.py", line 31, in <module>
    from airflow import settings
  File "/usr/local/lib/python2.7/site-packages/airflow/settings.py", line 148, in <module>
    configure_logging()
  File "/usr/local/lib/python2.7/site-packages/airflow/logging_config.py", line 75, in configure_logging
    raise e
ValueError: Unable to configure handler 'file.processor': Project was not passed and could not be determined from the environment.
"   
  metadata: {
   creationTimestamp:  "2018-08-31T15:15:07Z"    
   name:  "airflow-worker-549d57f499-2pjzw.154fde634baf8d5c"    
   namespace:  "default"    
   resourceVersion:  "94410"    
   selfLink:  "/api/v1/namespaces/default/events/airflow-worker-549d57f499-2pjzw.154fde634baf8d5c"    
   uid:  "a562b83b-ad30-11e8-9038-42010a800056"    
  }
  reason:  "Unhealthy"   
  source: {
   component:  "kubelet"    
   host:  "gke-us-central1-sk-cloud-default-pool-0629bc4f-kjd1"    
  }
  type:  "Warning"   
 }
 logName:  "projects/sk-data-platform/logs/events"  
 receiveTimestamp:  "2018-08-31T15:15:13.183327360Z"  
 resource: {
  labels: {
   cluster_name:  "us-central1-sk-cloud-compos-fd6c07e8-gke"    
   location:  ""    
   project_id:  "sk-data-platform"    
  }
  type:  "gke_cluster"   
 }
 severity:  "WARNING"  
 timestamp:  "2018-08-31T15:15:07Z"  
}

And finally:
{
 insertId:  "ftp6urfb700zv"  
 jsonPayload: {
  apiVersion:  "v1"   
  involvedObject: {
   apiVersion:  "v1"    
   fieldPath:  "spec.containers{airflow-worker}"    
   kind:  "Pod"    
   name:  "airflow-worker-549d57f499-2pjzw"    
   namespace:  "default"    
   resourceVersion:  "6297657"    
   uid:  "c07a3081-a013-11e8-9038-42010a800056"    
  }
  kind:  "Event"   
  message:  "Killing container with id docker://airflow-worker:Container failed liveness probe.. Container will be killed and recreated."   
  metadata: {
   creationTimestamp:  "2018-08-31T16:13:26Z"    
   name:  "airflow-worker-549d57f499-2pjzw.154fd53168dc69c2"    
   namespace:  "default"    
   resourceVersion:  "94622"    
   selfLink:  "/api/v1/namespaces/default/events/airflow-worker-549d57f499-2pjzw.154fd53168dc69c2"    
   uid:  "ca8488fc-ad38-11e8-9038-42010a800056"    
  }
  reason:  "Killing"   
  source: {
   component:  "kubelet"    
   host:  "gke-us-central1-sk-cloud-default-pool-0629bc4f-kjd1"    
  }
  type:  "Normal"   
 }
 logName:  "projects/sk-data-platform/logs/events"  
 receiveTimestamp:  "2018-08-31T16:13:31.236505156Z"  
 resource: {
  labels: {
   cluster_name:  "us-central1-sk-cloud-compos-fd6c07e8-gke"    
   location:  ""    
   project_id:  "sk-data-platform"    
  }
  type:  "gke_cluster"   
 }
 severity:  "INFO"  
 timestamp:  "2018-08-31T16:13:26Z"  
}

Any suggestions on what we can do here?

Thanks,
--Ben

Feng Lu

unread,
Sep 1, 2018, 12:21:33 AM9/1/18
to Ben Vogan, cloud-composer-discuss
Which Composer version are you using? 

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/ffc4cece-d2c0-4ff9-a6c4-03762ef567af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Vogan

unread,
Sep 1, 2018, 10:05:01 PM9/1/18
to Feng Lu, cloud-composer-discuss
We are using composer-1.1.0-airflow-1.9.0.  Things have gotten much worse.  All of our worker pods were just in a CrashLoopBackoff state (previously we had seen this only with the scheduler).  I just deleted them and they came back up, but our environment is totally unreliable right now.

Help would be greatly appreciated.  I don't know where to look to debug this.

Thanks,
--Ben

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture

Cheng Liu

unread,
Sep 1, 2018, 11:33:06 PM9/1/18
to b...@shopkick.com, fen...@google.com, cloud-compo...@googlegroups.com
Hi Ben,

Will you be able to get any logs from the Kubernetes container directly? You can try the steps in this link:

Once you find the corresponding pod name after 
kubectl get pods
Try this
kubectl logs {pod name} -c {container name}

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/CAAoNsdmAyZftzfmSh4brxrQAzC%2BSirZnsFwhyyNVYWR3B2K_rA%40mail.gmail.com.

Feng Lu

unread,
Sep 2, 2018, 1:02:49 AM9/2/18
to Ben Vogan, cloud-composer-discuss
Hi Ben,

Are you sure this is a composer-1.1.0 environment as I didn't see any composer-fluentd pods there in your kubectl get pods output? 

Feng 

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

Ben Vogan

unread,
Sep 2, 2018, 10:18:30 AM9/2/18
to Feng Lu, cloud-composer-discuss
Yes it is a composer 1.1.0 - I just didn't include the fluentd pods in the email as they are running fine.

C:\Users\Ben\AppData\Local\Google\Cloud SDK>kubectl get pods
NAME                                 READY     STATUS             RESTARTS   AGE
airflow-redis-0                      1/1       Running            0          1d
airflow-sqlproxy-7d576b7d8f-t8g8x    1/1       Running            0          1d
airflow-webserver-67b88557f8-mfxwf   2/2       Running            0          1d
airflow-worker-7f47d7fdff-cq6zb      1/2       CrashLoopBackOff   11         1h
airflow-worker-7f47d7fdff-jpgzk      1/2       CrashLoopBackOff   9          1h
airflow-worker-7f47d7fdff-mbd2d      1/2       CrashLoopBackOff   9          1h
airflow-worker-7f47d7fdff-qmj9m      1/2       CrashLoopBackOff   13         1h
airflow-worker-7f47d7fdff-qqw84      2/2       Running            12         1h
airflow-worker-7f47d7fdff-xvrq8      1/2       CrashLoopBackOff   15         1h
airflow-worker-7f47d7fdff-zpf6c      1/2       CrashLoopBackOff   9          46m
composer-fluentd-daemon-dzh7s        1/1       Running            0          1d
composer-fluentd-daemon-f7js7        1/1       Running            0          1d
composer-fluentd-daemon-kqsxg        1/1       Running            0          1d
composer-fluentd-daemon-ltjgq        1/1       Running            0          1d
composer-fluentd-daemon-lz68w        1/1       Running            0          1d
composer-fluentd-daemon-mljnw        1/1       Running            0          1d
composer-fluentd-daemon-z54sz        1/1       Running            0          1d


Trying to pull logs from a container in CrashLoopBackOff I get the following:

C:\Users\Ben\AppData\Local\Google\Cloud SDK>kubectl logs airflow-worker-7f47d7fdff-cq6zb -c airflow-worker
Using mount point: /home/airflow/gcsfuse
Opening GCS connection...
Opening bucket...
Mounting file system...
File system has been successfully mounted.
Fetching cluster endpoint and auth data.
kubeconfig entry generated for us-central1-sk-cloud-compos-f7bec760-gke.
[2018-09-02 14:08:56,077] {__init__.py:45} INFO - Using executor CeleryExecutor
[2018-09-02 14:08:57,453] {__init__.py:45} INFO - Using executor CeleryExecutor
Starting flask
[2018-09-02 14:08:57,514] {_internal.py:88} INFO -  * Running on http://0.0.0.0:8793/ (Press CTRL+C to quit)

worker: Warm shutdown (MainProcess)

 -------------- celery@airflow-worker-7f47d7fdff-cq6zb v4.2.1 (windowlicker)
---- **** -----
--- * ***  * -- Linux-4.4.111+-x86_64-with-debian-8.9 2018-09-02 14:08:56
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app:         airflow.executors.celery_executor:0x7fa02c841710
- ** ---------- .> transport:   redis://airflow-redis-service:6379/0
- ** ---------- .> results:     redis://airflow-redis-service:6379/0
- *** --- * --- .> concurrency: 6 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> default          exchange=default(direct) key=default

Thanks for the help.

--Ben


To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture

Ben Vogan

unread,
Sep 3, 2018, 12:42:48 PM9/3/18
to Feng Lu, cloud-composer-discuss
I came in this morning and to make matters worse there is no scheduler pod anymore:

C:\Users\Ben\AppData\Local\Google\Cloud SDK>kubectl get pods
NAME                                 READY     STATUS             RESTARTS   AGE
airflow-redis-0                      1/1       Running            0          3d
airflow-sqlproxy-7d576b7d8f-t8g8x    1/1       Running            0          3d
airflow-webserver-67b88557f8-mfxwf   2/2       Running            0          2d
airflow-worker-7f47d7fdff-7w6dt      2/2       Running            0          3m
airflow-worker-7f47d7fdff-9phzr      2/2       Running            1          6m
airflow-worker-7f47d7fdff-cq6zb      1/2       CrashLoopBackOff   255        1d
airflow-worker-7f47d7fdff-jpgzk      1/2       CrashLoopBackOff   253        1d
airflow-worker-7f47d7fdff-mbd2d      1/2       CrashLoopBackOff   253        1d
airflow-worker-7f47d7fdff-xvrq8      1/2       CrashLoopBackOff   259        1d
airflow-worker-7f47d7fdff-zpf6c      1/2       CrashLoopBackOff   253        1d
composer-fluentd-daemon-dzh7s        1/1       Running            0          2d
composer-fluentd-daemon-kqsxg        1/1       Running            0          3d
composer-fluentd-daemon-ltjgq        1/1       Running            0          3d
composer-fluentd-daemon-lz68w        1/1       Running            0          2d
composer-fluentd-daemon-mljnw        1/1       Running            0          3d
composer-fluentd-daemon-z54sz        1/1       Running            0          2d

I really need some help here - this system is in production (as Composer was supposedly GA).  Any ideas?

Thanks,
--Ben

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

Ben Vogan

unread,
Sep 3, 2018, 3:33:35 PM9/3/18
to Feng Lu, cloud-composer-discuss
We spun up yet another new cluster, and right after putting in our dag definitions the scheduler pod started crashing and went into a CrashLoopBackoff state.  Once again there are no errors at all in the scheduler logs. The only errors I can find in the Kubernetes Cluster logs are these:

{
 insertId:  "e5rkgkf3kf4aw"  
 labels: {
  cluster_version:  "1.9.7-gke.6"   
 }
 logName:  "projects/sk-data-platform/logs/cloudaudit.googleapis.com%2Factivity"  
 operation: {
  first:  true   
  id:  "e9e2ddc1-017b-49aa-aacb-5ce3f70c043b"   
  producer:  "k8s.io"   
 }
 protoPayload: {
  authenticationInfo: {
   principalEmail:  "<redacted>"    
  }
  authorizationInfo: [
   0: {
    granted:  true     
    permission:  "io.k8s.core.v1.pods.exec.create"     
    resource:  "core/v1/namespaces/default/pods/airflow-worker-f89d5c658-89bjj/exec/airflow-worker-f89d5c658-89bjj"     
    resourceAttributes: {
    }
   }
  ]
  methodName:  "io.k8s.core.v1.pods.exec.create"   
  requestMetadata: {
   callerIp:  "64.137.138.40"    
   destinationAttributes: {
   }
   requestAttributes: {
   }
  }
  resourceName:  "core/v1/namespaces/default/pods/airflow-worker-f89d5c658-89bjj/exec/airflow-worker-f89d5c658-89bjj"   
  serviceName:  "k8s.io"   
  status: {
   code:  2    
   message:  "UNKNOWN"    
  }
 }
 receiveTimestamp:  "2018-09-03T18:53:07.096888743Z"  
 resource: {
  labels: {
   cluster_name:  "us-central1-sk-cloud-compos-a63437c0-gke"    
   location:  "us-central1-c"    
   project_id:  "sk-data-platform"    
  }
  type:  "k8s_cluster"   
 }
 severity:  "ERROR"  
 timestamp:  "2018-09-03T18:52:49.872286Z"  
}

And

{
 insertId:  "qx04mee7h07h"  
 labels: {
  cluster_version:  "1.9.7-gke.6"   
 }
 logName:  "projects/sk-data-platform/logs/cloudaudit.googleapis.com%2Factivity"  
 operation: {
  id:  "b72a4894-7693-426a-8103-c1ae4c4dbcd5"   
  producer:  "k8s.io"   
 }
 protoPayload: {
  authenticationInfo: {
   principalEmail:  "system:serviceaccount:kube-system:certificate-controller"    
  }
  authorizationInfo: [
   0: {
    permission:  "io.k8s.certificates.v1beta1.certificatesigningrequests.delete"     
   }
  ]
  methodName:  "io.k8s.certificates.v1beta1.certificatesigningrequests.delete"   
  requestMetadata: {
   callerIp:  "::1"    
  }
  response: {
   @type:  "core.k8s.io/v1.Status"    
   apiVersion:  "v1"    
   code:  403    
   details: {
    group:  "certificates.k8s.io"     
    kind:  "certificatesigningrequests"     
    name:  "node-csr-kprvNC-LgiPn_HsqbWknxmhjCvMAvvq-nhUFSp88CmY"     
   }
   kind:  "Status"    
   message:  "certificatesigningrequests.certificates.k8s.io "node-csr-kprvNC-LgiPn_HsqbWknxmhjCvMAvvq-nhUFSp88CmY" is forbidden: User "system:serviceaccount:kube-system:certificate-controller" cannot delete certificatesigningrequests.certificates.k8s.io at the cluster scope: Unknown user "system:serviceaccount:kube-system:certificate-controller""    
   metadata: {
   }
   reason:  "Forbidden"    
   status:  "Failure"    
  }
  serviceName:  "k8s.io"   
  status: {
   code:  7    
   message:  "certificatesigningrequests.certificates.k8s.io "node-csr-kprvNC-LgiPn_HsqbWknxmhjCvMAvvq-nhUFSp88CmY" is forbidden: User "system:serviceaccount:kube-system:certificate-controller" cannot delete certificatesigningrequests.certificates.k8s.io at the cluster scope: Unknown user "system:serviceaccount:kube-system:certificate-controller""    
  }
 }
 receiveTimestamp:  "2018-09-03T18:56:21.492555458Z"  
 resource: {
  labels: {
   cluster_name:  "us-central1-sk-cloud-compos-a63437c0-gke"    
   location:  "us-central1-c"    
   project_id:  "sk-data-platform"    
  }
  type:  "k8s_cluster"   
 }
 severity:  "ERROR"  
 timestamp:  "2018-09-03T18:56:10.194429Z"  
}

--Ben

Cheng Liu

unread,
Sep 3, 2018, 3:41:31 PM9/3/18
to Ben Vogan, Feng Lu, cloud-composer-discuss
I experienced the cases that the DAG crashes as soon as I import certain libraries (like cloud). Could they related?

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/CAAoNsdndNQRfFwQ2%3DxjOD-uAa95Lx46%2B6HaQd-ANdcL5QreWew%40mail.gmail.com.

Ben Vogan

unread,
Sep 3, 2018, 6:18:02 PM9/3/18
to Cheng Liu, Feng Lu, cloud-composer-discuss
The strange thing is that the airflow logs show no errors.  The errors I have included in this thread are the only ones I can find anywhere.  Our PYPI packages are just:

google-cloud   >=0.28.0
google-cloud-bigquery  ==0.28.0
scikit-learn  ==0.19.1
scipy  ==0.19.0

We do of course import things from google.cloud (bigquery for example), but really we aren't doing anything particularly complicated at the moment.  Mostly scheduling jobs to run on BigQuery.


--Ben



To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-composer-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/CAENJcPPo1xQd-NZAgd3ZUoD3-PUS0CdravTgpQVaG7LrA0UWQg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Cheng Liu

unread,
Sep 3, 2018, 6:32:09 PM9/3/18
to cloud-composer-discuss
google-cloud was the one that gave me the trouble.

https://groups.google.com/forum/#!topic/cloud-composer-discuss/0QQp5aZHDSo
To post to this group, send email to cloud-compo...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

Ben Vogan

unread,
Sep 4, 2018, 12:00:09 PM9/4/18
to Cheng Liu, Feng Lu, cloud-composer-discuss
I think we have found the issue - at least the part causing the scheduler to restart (not sure about the worker restarts yet).  Looking at scheduler_checker.py line 58 we see that the liveness check is really not talking to the scheduler container - it is checking stackdriver logging using the following command:

gcloud beta logging read 'resource.type="container" AND resource.labels.cluster_name="us-central1-sk-cloud-compos-fd6c07e8-gke" AND resource.labels.namespace_id="default" AND logName="projects/sk-data-platform/logs/airflow-scheduler"' --limit 1 --format json --project sk-data-platform

We had turned off stackdriver logging because the cost is so exorbitant.  Since turning the logging back on we have not seen any further scheduler restarts.  

It seems to me a poor choice for a liveness check to be depending on logs.

--Ben

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.

To post to this group, send email to cloud-composer-discuss@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsubscri...@googlegroups.com.

To post to this group, send email to cloud-composer-discuss@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture

Feng Lu

unread,
Sep 4, 2018, 12:34:32 PM9/4/18
to Ben Vogan, Cheng Liu, cloud-composer-discuss
Great! 
I agree that relying on stackdriver logs for health checking [1] is not an ideal solution, we are working on the following ways to improve that experience: 
- improve airflow scheduler so there's no scheduler freeze
- other means of detecting scheduler fault. 


To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture

Irvi Firqotul Aini

unread,
Oct 20, 2018, 9:38:27 AM10/20/18
to cloud-composer-discuss
Hi I also have the same problem, checking both of the pods and stackdriver logs seems like nothing wrong happened. I haven't install any new plugin yet, and this just happened over and over since this morning GMT +7. Do you have any advice?

Best,
Irvi
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture

Feng Lu

unread,
Oct 22, 2018, 11:59:35 PM10/22/18
to Irvi Firqotul Aini, cloud-compo...@googlegroups.com
Could you share GKE events related to the restarts? 

Step 1: Go to GKE workload view page. 
yvZhFHHJKCu.png

Step 2: Click into pod details and check under "Event" and "Log" tab.
kA231YM94cu.png
Alternatively, feel free to open a support case so you can share with us environment details and debugging artifacts.

Feng 

To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

Irvi Firqotul Aini

unread,
Oct 23, 2018, 12:59:19 AM10/23/18
to cloud-composer-discuss
Already opened a support and support guy asked me to recreate my env and wait for the next release which is most likely the symptomatic behavior will be patched. Still waiting for the next release though.

Best,
Irvi

Kwasi Ohene-Adu

unread,
Feb 15, 2019, 7:05:05 AM2/15/19
to cloud-composer-discuss
Hi Feng,

Do you by any chance have an ETA for the move away from using Stackdriver logs for scheduler liveness? Just like Ben Vogan, we aggressively use log exclusion filters so we can stay below the Stackdriver free tier quota. At the moment, we're incurring a bit of cost on a monthly basis due to liveness checking on the scheduler.

Thanks!
Kwasi
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture




--
BENJAMIN VOGAN Director of Architecture

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.



--
BENJAMIN VOGAN Director of Architecture

Reply all
Reply to author
Forward
0 new messages