kubernetes pod operator throwing error when task succeeds

2,715 views
Skip to first unread message

Anthony Brown

unread,
Aug 30, 2018, 8:19:10 AM8/30/18
to cloud-compo...@googlegroups.com
Hi
   I am using the kubernetes pod operator to run a container on the kubernetes cluster and sometimes the airflow dag is failing when the container I want to run succeeds. This only happens occasionally and most times, everything works fine

   When it fails the container that I am running finishes and is removed, but the airflow-xcom-sidecar container keeps running. This ends up using resources on the cluster until eventually no more pods can be launched

   This message appears in the airflow logs

[2018-08-20 09:50:47,885] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-08-20 09:50:47,886] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/bin/airflow", line 27, in <module>
[2018-08-20 09:50:47,892] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2018-08-20 09:50:47,896] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-08-20 09:50:47,899] {base_task_runner.py:98} INFO - Subtask:     pool=args.pool,
[2018-08-20 09:50:47,900] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-08-20 09:50:47,903] {base_task_runner.py:98} INFO - Subtask:     result = func(*args, **kwargs)
[2018-08-20 09:50:47,904] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 1492, in _run_raw_task
[2018-08-20 09:50:47,906] {base_task_runner.py:98} INFO - Subtask:     result = task_copy.execute(context=context)
[2018-08-20 09:50:47,907] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/contrib/operators/kubernetes_pod_operator.py", line 115, in execute
[2018-08-20 09:50:47,909] {base_task_runner.py:98} INFO - Subtask:     get_logs=self.get_logs)
[2018-08-20 09:50:47,909] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 81, in run_pod
[2018-08-20 09:50:47,910] {base_task_runner.py:98} INFO - Subtask:     return self._monitor_pod(pod, get_logs)
[2018-08-20 09:50:47,912] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 98, in _monitor_pod
[2018-08-20 09:50:47,912] {base_task_runner.py:98} INFO - Subtask:     while self.base_container_is_running(pod):
[2018-08-20 09:50:47,913] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 125, in base_container_is_running
[2018-08-20 09:50:47,914] {base_task_runner.py:98} INFO - Subtask:     event = self.read_pod(pod)
[2018-08-20 09:50:47,914] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 132, in read_pod
[2018-08-20 09:50:47,916] {base_task_runner.py:98} INFO - Subtask:     return self._client.read_namespaced_pod(pod.name, pod.namespace)
[2018-08-20 09:50:47,916] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 18163, in read_namespaced_pod
[2018-08-20 09:50:48,008] {base_task_runner.py:98} INFO - Subtask:     (data) = self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)
[2018-08-20 09:50:48,012] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 18254, in read_namespaced_pod_with_http_info
[2018-08-20 09:50:48,018] {base_task_runner.py:98} INFO - Subtask:     collection_formats=collection_formats)
[2018-08-20 09:50:48,021] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
[2018-08-20 09:50:48,052] {base_task_runner.py:98} INFO - Subtask:     _return_http_data_only, collection_formats, _preload_content, _request_timeout)
[2018-08-20 09:50:48,053] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
[2018-08-20 09:50:48,055] {base_task_runner.py:98} INFO - Subtask:     _request_timeout=_request_timeout)
[2018-08-20 09:50:48,055] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
[2018-08-20 09:50:48,057] {base_task_runner.py:98} INFO - Subtask:     headers=headers)
[2018-08-20 09:50:48,058] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
[2018-08-20 09:50:48,094] {base_task_runner.py:98} INFO - Subtask:     query_params=query_params)
[2018-08-20 09:50:48,099] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
[2018-08-20 09:50:48,100] {base_task_runner.py:98} INFO - Subtask:     raise ApiException(http_resp=r)
[2018-08-20 09:50:48,102] {base_task_runner.py:98} INFO - Subtask: kubernetes.client.rest.ApiException: (401)
[2018-08-20 09:50:48,103] {base_task_runner.py:98} INFO - Subtask: Reason: Unauthorized
[2018-08-20 09:50:48,103] {base_task_runner.py:98} INFO - Subtask: HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 20 Aug 2018 09:50:47 GMT', 'Audit-Id': '7ad84124-23c1-420b-818c-6c6a7e5c1d7a', 'Content-Length': '129', 'Content-Type': 'application/json', 'Www-Authenticate': 'Basic realm="kubernetes-master"'})
[2018-08-20 09:50:48,107] {base_task_runner.py:98} INFO - Subtask: HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
[2018-08-20 09:50:48,107] {base_task_runner.py:98} INFO - Subtask: 
[2018-08-20 09:50:48,109] {base_task_runner.py:98} INFO - Subtask: 

   I am currently running the composer-1.0.0 release. I cant see anything in the 1.1.0 release that may fix this, but am willing to try running on there if anybody thinks it may help

Thanks

--
-- 

Anthony Brown
Data Engineer BI Team - John Lewis
Tel : 0787 215 7305

**********************************************************************
This email is confidential and may contain copyright material of the John Lewis Partnership.
If you are not the intended recipient, please notify us immediately and delete all copies of this message.
(Please note that it is your responsibility to scan this message for viruses). Email to and from the
John Lewis Partnership is automatically monitored for operational and lawful business reasons.
**********************************************************************

John Lewis plc
Registered in England 233462
Registered office 171 Victoria Street London SW1E 5NN

Websites: https://www.johnlewis.com
http://www.waitrose.com
https://www.johnlewisfinance.com
http://www.johnlewispartnership.co.uk

**********************************************************************

Cameron Moberg

unread,
Aug 30, 2018, 9:08:17 AM8/30/18
to Anthony Brown, cloud-composer-discuss
Hi Anthony,

About how long does the pod you are launching run for?

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-composer-discuss/CAHRXA7ESvowvh5EVGVUFP6VOC1YcCgex5Wg4_%3D-Y62YrZY0UXg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Anthony Brown

unread,
Aug 30, 2018, 9:57:31 AM8/30/18
to cjmo...@google.com, cloud-compo...@googlegroups.com
It runs for about 9 minutes. Other runs that take as long (or longer) work fine 


For more options, visit https://groups.google.com/d/optout.
Message has been deleted

aurelie...@luxola.com

unread,
Sep 17, 2018, 4:51:46 AM9/17/18
to cloud-composer-discuss
Hello,  I am running into the exact same issue.
Did you manage to find a solution ? 

Anthony Brown

unread,
Sep 17, 2018, 10:22:05 AM9/17/18
to cloud-compo...@googlegroups.com
I have not found a solution and am still getting the issue.


On Mon, 17 Sep 2018 at 09:51, <aurelie...@luxola.com> wrote:
Hello,  I am running into the exact same issue.
Did you manage to find a solution ? 

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
-- 

Anthony Brown
Data Engineer BI Team - John Lewis
Tel : 0787 215 7305
Message has been deleted

aurelie...@luxola.com

unread,
Sep 18, 2018, 5:27:15 AM9/18/18
to cloud-composer-discuss
Okay, I scaled up the node-pool in which the pod was running and the error didn't show up anymore.
Maybe the pod was simply running out of resources, even though I am not sure it would explain that the tasks succeeds anyways.

Anthony Brown

unread,
Oct 5, 2018, 10:36:13 AM10/5/18
to cloud-compo...@googlegroups.com
I scaled up the node pool and it did help in that it did not happen as often, but we still kept on getting these errors on long running (over about 15 minutes) pods.
I think it may be related to https://github.com/kubernetes-client/python-base/issues/59 where the tokens used are not refreshed properly. There is a suggested workaround in there, but somebody is also working on a proper fix so probably not worth trying it get the workaround into airflow.

Meanwhile, I have been trying setting in_cluster to True in my task which seems to have helped.

Warning - this modifies the security on the composer kubernetes cluster allowing pods in the kube-public namespace to modify other pods.
You first need to run this command to give the default user the required permissions

# make sure you have correct kubernetes credentials using gcloud container clusters get-credentials command first

kubectl create rolebinding default-admin \
  --clusterrole=cluster-admin \
  --serviceaccount=default:default \
  --namespace=kube-public

Then in your airflow DAG call the kubernetes pod operator passing in_cluster and namespace

    POD_TASK = kubernetes_pod_operator.KubernetesPodOperator(
        task_id='pod_task',
        name='pod-task',
        in_cluster=True,
        namespace='kube-public',
        image='xxxx')


On Tue, 18 Sep 2018 at 10:27, <aurelie...@luxola.com> wrote:
Okay, I scaled up the node-pool in which the pod was running and the error didn't show up anymore.
Maybe the pod was simply running out of resources, even though I am not sure it would explain that the tasks succeeds anyways.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.
To post to this group, send email to cloud-compo...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Amir Amangeldi

unread,
Jan 7, 2019, 11:06:12 AM1/7/19
to cloud-composer-discuss
Hi all,

Thank you Anthony for the workaround. Unfortunately, my team is working with sensitive data and we don't want to expose it in the 'kube-public' namespace.
I am wondering if there have been any updates from the Cloud Composer team regarding a proper fix for this issue?

Thank you,
Amir Amangeldi


EVERQUOTE  |  Software Engineer





On Friday, October 5, 2018 at 10:36:13 AM UTC-4, Anthony Brown wrote:
I scaled up the node pool and it did help in that it did not happen as often, but we still kept on getting these errors on long running (over about 15 minutes) pods.
I think it may be related to https://github.com/kubernetes-client/python-base/issues/59 where the tokens used are not refreshed properly. There is a suggested workaround in there, but somebody is also working on a proper fix so probably not worth trying it get the workaround into airflow.

Meanwhile, I have been trying setting in_cluster to True in my task which seems to have helped.

Warning - this modifies the security on the composer kubernetes cluster allowing pods in the kube-public namespace to modify other pods.
You first need to run this command to give the default user the required permissions

# make sure you have correct kubernetes credentials using gcloud container clusters get-credentials command first

kubectl create rolebinding default-admin \
  --clusterrole=cluster-admin \
  --serviceaccount=default:default \
  --namespace=kube-public

Then in your airflow DAG call the kubernetes pod operator passing in_cluster and namespace

    POD_TASK = kubernetes_pod_operator.KubernetesPodOperator(
        task_id='pod_task',
        name='pod-task',
        in_cluster=True,
        namespace='kube-public',
        image='xxxx')


On Tue, 18 Sep 2018 at 10:27, <aurelie...@luxola.com> wrote:
Okay, I scaled up the node-pool in which the pod was running and the error didn't show up anymore.
Maybe the pod was simply running out of resources, even though I am not sure it would explain that the tasks succeeds anyways.

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.

Wilson Lian

unread,
Jan 7, 2019, 6:06:15 PM1/7/19
to Amir Amangeldi, cloud-composer-discuss
Please try installing [1] the PyPI package kubernetes>=8.0.1 in your environment. It includes the fix to https://github.com/kubernetes-client/python-base/issues/59. If the issue persists after that, please post the full task instance log if you're able to.


To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.


--
-- 

Anthony Brown
Data Engineer BI Team - John Lewis
Tel : 0787 215 7305

**********************************************************************
This email is confidential and may contain copyright material of the John Lewis Partnership.
If you are not the intended recipient, please notify us immediately and delete all copies of this message.
(Please note that it is your responsibility to scan this message for viruses). Email to and from the
John Lewis Partnership is automatically monitored for operational and lawful business reasons.
**********************************************************************

John Lewis plc
Registered in England 233462
Registered office 171 Victoria Street London SW1E 5NN

Websites: https://www.johnlewis.com
http://www.waitrose.com
https://www.johnlewisfinance.com
http://www.johnlewispartnership.co.uk

**********************************************************************

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-di...@googlegroups.com.

To post to this group, send email to cloud-compo...@googlegroups.com.

Amir Amangeldi

unread,
Jan 8, 2019, 2:24:24 PM1/8/19
to cloud-composer-discuss
This solution worked perfectly, thank you Wilson!
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.


--
-- 

Anthony Brown
Data Engineer BI Team - John Lewis
Tel : 0787 215 7305

**********************************************************************
This email is confidential and may contain copyright material of the John Lewis Partnership.
If you are not the intended recipient, please notify us immediately and delete all copies of this message.
(Please note that it is your responsibility to scan this message for viruses). Email to and from the
John Lewis Partnership is automatically monitored for operational and lawful business reasons.
**********************************************************************

John Lewis plc
Registered in England 233462
Registered office 171 Victoria Street London SW1E 5NN

Websites: https://www.johnlewis.com
http://www.waitrose.com
https://www.johnlewisfinance.com
http://www.johnlewispartnership.co.uk

**********************************************************************

--
You received this message because you are subscribed to the Google Groups "cloud-composer-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-composer-discuss+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages