Job running forever

Gabriel Richter

unread,

Apr 8, 2024, 11:34:20 AM4/8/24

to rundeck-discuss

Hi!

I'm using Rundeck Community 4.17.1.

The database is a Azure PostgreSQL Flexible Server (PaaS service) in the version 15.4.

Sometimes, my job just don't finished. It continues running forever. When this happen, I can see an open connection in the database running the following queries:

select this_.id as id1_14_0_, this_.version as version2_14_0_, this_.log_output_threshold as log_outp3_14_0_, this_.do_nodedispatch as do_noded4_14_0_, this_.next_execution as next_exe5_14_0_, this_.date_created as date_cre6_14_0_, this_.node_keepgoing as node_kee7_14_0_, this_.node_exclude_os_arch as node_exc8_14_0_, this_.uuid as uuid9_14_0_, this_.node_include as node_in10_14_0_, this_.success_on_empty_node_filter as success11_14_0_, this_.node_exclude_os_version as node_ex12_14_0_, this_.timeout as timeout13_14_0_, this_.node_exclude_precedence as node_ex14_14_0_, this_.node_exclude_name as node_ex15_14_0_, this_.notify_avg_duration_threshold as notify_16_14_0_, this_.day_of_week as day_of_17_14_0_, this_.node_include_os_version as node_in18_14_0_, this_.node_exclude_os_name as node_ex19_14_0_, this_.retry as retry20_14_0_, this_.filter as filter21_14_0_, this_.group_path as group_p22_14_0_, this_.scheduled as schedul23_14_0_, this_.orchestrator_id as orchest24_14_0_, this_.node_threadcount_dynamic as node_th25_14_0_, this_.node_include_name as node_in26_14_0_, this_.multiple_executions as multipl27_14_0_, this_.time_zone as time_zo28_14_0_, this_.rduser as rduser29_14_0_, this_.node_include_os_name as node_in30_14_0_, this_.filter_exclude as filter_31_14_0_, this_.node_exclude as node_ex32_14_0_, this_.node_rank_order_ascending as node_ra33_14_0_, this_.nodes_selected_by_default as nodes_s34_14_0_, this_.node_include_os_arch as node_in35_14_0_, this_.loglevel as logleve36_14_0_, this_.node_exclude_os_family as node_ex37_14_0_, this_.execution_enabled as executi38_14_0_, this_.max_multiple_executions as max_mul39_14_0_, this_.node_include_os_family as node_in40_14_0_, this_.last_updated as last_up41_14_0_, this_.retry_delay as retry_d42_14_0_, this_.workflow_id as workflo43_14_0_, this_.exec_count as exec_co44_14_0_, this_.month as month45_14_0_, this_.hour as hour46_14_0_, this_.log_output_threshold_action as log_out47_14_0_, this_.arg_string as arg_str48_14_0_, this_.user_role_list as user_ro49_14_0_, this_.total_time as total_t50_14_0_, this_.node_rank_attribute as node_ra51_14_0_, this_.server_nodeuuid as server_52_14_0_, this_.default_tab as default53_14_0_, this_.node_exclude_tags as node_ex54_14_0_, this_.seconds as seconds55_14_0_, this_.exclude_filter_uncheck as exclude56_14_0_, this_.ref_exec_count as ref_exe57_14_0_, this_.node_threadcount as node_th58_14_0_, this_.node_include_tags as node_in59_14_0_, this_.job_name as job_nam60_14_0_, this_.schedule_enabled as schedul61_14_0_, this_.year as year62_14_0_, this_.day_of_month as day_of_63_14_0_, this_.node_filter_editable as node_fi64_14_0_, this_.log_output_threshold_status as log_out65_14_0_, this_.description as descrip66_14_0_, this_.minute as minute67_14_0_, this_.project as project68_14_0_, this_.plugin_config as plugin_69_14_0_ from scheduled_execution this_ where this_.uuid = ? limit ?

... and

insert into base_report ( version, tags, node, ctx_type, author, title, message, date_completed, date_started, status, action_type, mapref_uri, ctx_project, report_id, ctx_name, aborted_by_user, ctx_controller, job_uuid, succeeded_node_list, ctx_command, failed_node_list, adhoc_execution, adhoc_script, execution_id, jc_job_id, filter_applied, class, id ) values ( ? )

The behavior is as if the application forgot to run a COMMIT or ROLLBACK in the end of the transaction. If I try to update the execution in the "execution" table it generates a lock because of the INSERT query, then I need to kill all these queries and performe the UPDATE manually.

Can someone help me please?

Thank you!

rac...@rundeck.com

unread,

Apr 8, 2024, 12:11:20 PM4/8/24

to rundeck-discuss

Hi Gabriel,

Does this happen at this job? or is it a general random issue? (Does it happen on ad-hoc commands too?) Please submit the job definition example as well as the contents of the service.log (at the time of this incident). Also, do you observe the same behavior in a standard PostgreSQL backend? (on a test env).

Regards.

Gabriel Richter

unread,

Apr 9, 2024, 10:16:50 AM4/9/24

to rundeck-discuss

Until yesterday it had happened twice. This morning it was a general problem that affected almost all jobs. I have jobs which starts the execution every an hour and it just stopped to run because of these database locks.

The locked query is:

update execution set version = ? scheduled_execution_id = ? do_nodedispatch = ? node_exclude_os_arch = ? node_keepgoing = ? succeeded_node_list = ? uuid = ? retry_attempt = ? node_include = ? success_on_empty_node_filter = ? retry_prev_id = ? extra_metadata = ? node_exclude_os_version = ? timeout = ? node_exclude_precedence = ? node_exclude_name = ? node_include_os_version = ? node_exclude_os_name = ? retry = ? filter = ? orchestrator_id = ? node_include_name = ? rduser = ? retry_original_id = ? execution_type = ? node_include_os_name = ? abortedby = ? filter_exclude = ? node_exclude = ? node_rank_order_ascending = ? node_include_os_arch = ? loglevel = ? node_exclude_os_family = ? job_uuid = ? node_include_os_family = ? cancelled = ? retry_delay = ? workflow_id = ? timed_out = ? failed_node_list = ? arg_string = ? user_role_list = ? node_rank_attribute = ? date_completed = ? outputfilepath = ? server_nodeuuid = ? will_retry = ? retry_execution_id = ? node_exclude_tags = ? exclude_filter_uncheck = ? node_threadcount = ? node_include_tags = ? date_started = ? status = ? node_filter_editable = ? project = ? where id = ? and version = ?

.. and the locking query is:

insert into base_report ( version, tags, node, ctx_type, author, title, message, date_completed, date_started, status, action_type, mapref_uri, ctx_project, report_id, ctx_name, aborted_by_user, ctx_controller, job_uuid, succeeded_node_list, ctx_command, failed_node_list, adhoc_execution, adhoc_script, execution_id, jc_job_id, filter_applied, class, id ) values ( ? )

In the service.log file we had a lot of this error:

Caused by: org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1; statement executed: update execution set version=?
, scheduled_execution_id=?, do_nodedispatch=?, node_exclude_os_arch=?, node_keepgoing=?, succeeded_node_list=?, uuid=?, retry_attempt=?, node_include=?, success_on_empty_node_filter=?, retry
_prev_id=?, extra_metadata=?, node_exclude_os_version=?, timeout=?, node_exclude_precedence=?, node_exclude_name=?, node_include_os_version=?, node_exclude_os_name=?, retry=?, filter=?, orch
estrator_id=?, node_include_name=?, rduser=?, retry_original_id=?, execution_type=?, node_include_os_name=?, abortedby=?, filter_exclude=?, node_exclude=?, node_rank_order_ascending=?, node_
include_os_arch=?, loglevel=?, node_exclude_os_family=?, job_uuid=?, node_include_os_family=?, cancelled=?, retry_delay=?, workflow_id=?, timed_out=?, failed_node_list=?, arg_string=?, user_
role_list=?, node_rank_attribute=?, date_completed=?, outputfilepath=?, server_nodeuuid=?, will_retry=?, retry_execution_id=?, node_exclude_tags=?, exclude_filter_uncheck=?, node_threadcount

=?, node_include_tags=?, date_started=?, status=?, node_filter_editable=?, project=? where id=? and version=?

It happened in all the jobs, so it's not a specific one.

Thank you!

Gabriel Richter

unread,

Apr 9, 2024, 10:38:08 AM4/9/24

to rundeck-discuss

It's also important to comment that looking at pg_stat_stamentes and DBM (Datadog database monitoring), the mean execution time for the insert is in micro seconds, so it does not justify the UPDATE to be in lock. It's like as the application just crashed and forgot to commit or rollback the transaction, then it continues active in the database and just killing the transaction the UPDATE is released to run.

Is that possible? Maybe a bug already fixed or something like that?

Message has been deleted

rac...@rundeck.com

unread,

Apr 9, 2024, 3:03:53 PM4/9/24

to rundeck-discuss

Hi Gabriel.

I've set up a similar environment (Rundeck 4.17.1 on Azure PostgreSQL Flexible Server) for testing, looking for a similar behavior. My backend is configured without Replication/HA options. Is this your case? If your backend is configured with the Replication / HA config, please consider this, and could be the issue's root cause. I'm still researching.

Regards.

Gabriel Richter

unread,

Apr 9, 2024, 5:11:51 PM4/9/24

to rundeck-discuss

I'm using a single Vm and a single PG, so, no replication/HA.

rac...@rundeck.com

unread,

Apr 10, 2024, 12:28:34 PM4/10/24

to rundeck-discuss

Hi Gabriel,

This is my test db environment

{ "sku": { "name": "Standard_B1ms", "tier": "Burstable" }, "systemData": { "createdAt": "2024-04-09T17:22:09.1507298Z" }, "properties": { "authConfig": { "activeDirectoryAuth": "Disabled", "passwordAuth": "Enabled" }, "dataEncryption": { "type": "SystemManaged" }, "fullyQualifiedDomainName": "dbname.postgres.database.azure.com", "version": "15", "minorVersion": "6", "administratorLogin": "user", "state": "Stopped", "availabilityZone": "1", "storage": { "storageSizeGB": 32 }, "backup": { "backupRetentionDays": 7, "geoRedundantBackup": "Disabled", "earliestRestoreDate": "2024-04-09T17:28:26.5884363+00:00" }, "network": { "publicNetworkAccess": "Disabled" }, "highAvailability": { "mode": "Disabled", "state": "" }, "maintenanceWindow": { "customWindow": "Disabled", "dayOfWeek": 0, "startHour": 0, "startMinute": 0 }, "replicationRole": "Primary", "replicaCapacity": 5 }, "location": "East US", "tags": { "created_by": "us...@example.net" }, "id": "/subscriptions/xxxx-xxxx-xxxx-xxxx-xxxx/resourceGroups/xxxx/providers/Microsoft.DBforPostgreSQL/flexibleServers/dbname", "name": "dbname", "type": "Microsoft.DBforPostgreSQL/flexibleServers" }

Could you check and compare? Probably I’m missing something.

Regards.

Gabriel Richter

unread,

Apr 10, 2024, 1:28:25 PM4/10/24

to rundeck-discuss

Hi

These is our server:

{

"sku": {
"name": "Standard_B1ms",
"tier": "Burstable"
},
"systemData": {

"createdAt": "2023-10-20T15:27:15.8037846Z"
},
"properties": {
"authConfig": {
"activeDirectoryAuth": "Enabled",
"passwordAuth": "Enabled",
"tenantId": "fake-tenant-id"

},
"dataEncryption": {
"type": "SystemManaged"
},

"fullyQualifiedDomainName": "pg-rundeck-prd.postgres.database.azure.com",
"version": "15",
"minorVersion": "4",
"administratorLogin": "username",
"state": "Ready",
"availabilityZone": "2",

"storage": {
"storageSizeGB": 32
},
"backup": {

"backupRetentionDays": 30,
"geoRedundantBackup": "Disabled",
"earliestRestoreDate": "2024-03-12T16:17:01.9322606+00:00"

},
"network": {
"publicNetworkAccess": "Disabled",

"delegatedSubnetResourceId": "fake-subnet-id",
"privateDnsZoneArmResourceId": "fake-private-dns-zone-id"

},
"highAvailability": {
"mode": "Disabled",

"state": "NotEnabled"
},
"maintenanceWindow": {
"customWindow": "Enabled",

"dayOfWeek": 0,
"startHour": 0,
"startMinute": 0
},
"replicationRole": "Primary",
"replicaCapacity": 5
},

"location": "Brazil South",
"tags": {},
"id": "fake-resource-id",
"name": "pg-rundeck-prd",
"type": "Microsoft.DBforPostgreSQL/flexibleServers"
}

Something you've tried yestesday was to change the idle_in_transaction_session_timeout default value to 60 seconds (on "Server parameters").

The idea is that if an "idle transaction" happened, PostgreSQL will kill this transaction in 60 seconds. So, even if an error ocurrer in Rundeck and a transaction stays opened, it would stop the other jobs to run.

Thank you!

Message has been deleted

Gabriel Richter

unread,

Apr 11, 2024, 8:34:27 AM4/11/24

to rundeck-discuss

Today I had the same problem, but this time, the lock was not present (maybe because of the idle_in_transation_time_out).

This time a single job was stucked. It runs every 1 hour and the last run was yesterday at 5 pm.

To "solve" this I just ran this on the database:

update execution set status ='failed', date_completed = now() where id = 327396

It's important to not that looking to the logs, the job should be finished. It recorded the end log normally:

[2024-04-10T20:05:04,255] INFO execution.status azure_devops finish [327396:failed] index-maintenance-mssql azure_devops/- "abi-br-prod/athena-prod-meslotes-reorganize -DOCKER_IMAGE_NAME docker.ambevdevs.com.br/ambevtech/automation/database/index/maintenance -DOCKER_IMAGE_TAG 2263044 -IMAGE_PULL_POLICY Always -COMMAND_TO_RUN "python app.py" -KUBERNETES_CLUSTER_NAME_TO_BE_USED cluster-name -AZURE_SUBSCRIPTION_ID my-azure-subscription-id -AZURE_RESOURCE_GROUP my-rg -SQL_SERVER_NAME server-name.database.windows.net -SQL_DATABASE_NAMES DBName -IS_TO_REORGANIZE_INDEXES True -IS_TO_REBUILD_INDEXES False -IS_TO_REBUILD_HEAP_TABLES False -MINIMUM_FRAGMENTATION_TO_REORGANIZE 5 -MINIMUM_FRAGMENTATION_TO_REBUILD 30 -IS_TO_UPDATE_STATISTICS False -MINIMUM_MODIFICATION_COUNTER 1000 -OUTDATED_IN_HOURS 12 -IS_TO_SIMULATE_EXECUTION False -LOG_LEVEL INFO"[index-maintenance-mssql-abi-br-prod-athena-prod-meslotes-reorganize]

After that, all next job execution was not performed because of this error:

[2024-04-10T22:00:00,067] ERROR quartzjobs.ExecutionJob [quartzScheduler_Worker-2] - Unable to start Job execution: Job "athena-prod-meslotes-reorganize" {{Job index-maintenance-mssql-abi-br-prod-athena-prod-meslotes-reorganize}}: Limit of running executions has been reached.

In the VM, I find this execution log here:

^text/x-rundeck-log-v2.0^
^2024-04-10T20:00:00Z|stepbegin||{node=localhost|step=1|stepctx=1|user=azure_devops}|^
^2024-04-10T20:00:00Z|nodebegin||{node=localhost|step=1|stepctx=1}|^
^2024-04-10T20:00:01Z|stepbegin||{node=localhost|step=1|stepctx=1@node\=localhost/1|user=azure_devops}|^
^2024-04-10T20:00:01Z|nodebegin||{node=localhost|step=1|stepctx=1@node\=localhost/1}|^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(!) Using kubeconfig file: /var/lib/rundeck/.kube/clustername.yaml^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(!) Using metadata file: /var/lib/rundeck/.kube/ clustername .metadata^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(!) K8S cluster metadata:^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Azure Subscription name: subscription-name^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Environment: prd^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|It is Brewzone? true^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Generating temporary kubeconfig file: /tmp/kubeconfig-ypfstdjfoqebk.yaml^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(!) Set KUBECONFIG has environment variable^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Logging in using kubelogin^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Creating the deployment file for namespace^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|apiVersion: v1^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|kind: Namespace^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|metadata:^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| name: rundeck-agents^
^2024-04-10T20:00:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Deploying namespace rundeck-agents^
^2024-04-10T20:00:04Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|namespace/rundeck-agents unchanged^
^2024-04-10T20:00:04Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Checking if the secret nexus exist^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(!) The secret nexus already exist. Do nothing!^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Creating the base deployment file for CONFIGMAP^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable AUTO_GENERATED_IS_BREWZONE on the configmap^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable AUTO_GENERATED_SUBSCRIPTION_NAME on the configmap^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable AUTO_GENERATED_ENVIROMENT on the configmap^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable AZURE_SUBSCRIPTION_ID on the configmap^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable AZURE_RESOURCE_GROUP on the configmap^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable SQL_SERVER_NAME on the configmap^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable SQL_DATABASE_NAMES on the configmap^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable IS_TO_REORGANIZE_INDEXES on the configmap^
^2024-04-10T20:00:06Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable IS_TO_REBUILD_INDEXES on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable IS_TO_REBUILD_HEAP_TABLES on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable MINIMUM_FRAGMENTATION_TO_REORGANIZE on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable MINIMUM_FRAGMENTATION_TO_REBUILD on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable IS_TO_UPDATE_STATISTICS on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable MINIMUM_MODIFICATION_COUNTER on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable OUTDATED_IN_HOURS on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable IS_TO_SIMULATE_EXECUTION on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable LOG_LEVEL on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable AZURE_TENANT_ID on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable AZURE_CLIENT_ID on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Adding environment variable AZURE_CLIENT_SECRET on the configmap^
^2024-04-10T20:00:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Deploying configmap rundeck-configs-ypfstdjfoqebk^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|configmap/rundeck-configs-ypfstdjfoqebk created^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Creating the deployment file for POD^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|apiVersion: v1^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|kind: Pod^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|metadata:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| name: rundeck-ypfstdjfoqebk^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|spec:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| containers:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| - name: main^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| image: my.dockerimage/blabla:latest^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| imagePullPolicy: Always^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| command: ["/bin/sleep", "7d"] ^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| envFrom:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| - configMapRef:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|name: rundeck-configs-ypfstdjfoqebk^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| resources:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|requests:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|memory: "256Mi"^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|cpu: "150m"^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|limits:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|memory: "512Mi"^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|cpu: "1000m"^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| imagePullSecrets:^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}| - name: nexus^
^2024-04-10T20:00:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Deploying POD name rundeck-ypfstdjfoqebk^
^2024-04-10T20:00:10Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|pod/rundeck-ypfstdjfoqebk created^
^2024-04-10T20:00:10Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|(+) Waiting until pod to be in RUNNING state^
^2024-04-10T20:00:11Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:11Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 1^
^2024-04-10T20:00:17Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:17Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 2^
^2024-04-10T20:00:22Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:22Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 3^
^2024-04-10T20:00:28Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:28Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 4^
^2024-04-10T20:00:34Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:34Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 5^
^2024-04-10T20:00:40Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:40Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 6^
^2024-04-10T20:00:45Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:45Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 7^
^2024-04-10T20:00:51Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:51Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 8^
^2024-04-10T20:00:57Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:00:57Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 9^
^2024-04-10T20:01:03Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:03Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 10^
^2024-04-10T20:01:08Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:08Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 11^
^2024-04-10T20:01:14Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:14Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 12^
^2024-04-10T20:01:20Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:20Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 13^
^2024-04-10T20:01:26Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:26Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 14^
^2024-04-10T20:01:32Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:32Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 15^
^2024-04-10T20:01:38Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:38Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 16^
^2024-04-10T20:01:44Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:44Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 17^
^2024-04-10T20:01:49Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:49Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 18^
^2024-04-10T20:01:55Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:01:55Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 19^
^2024-04-10T20:02:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:02:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 20^
^2024-04-10T20:02:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:02:07Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 21^
^2024-04-10T20:02:13Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Pending^
^2024-04-10T20:02:13Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 22^
^2024-04-10T20:02:19Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:02:19Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 23^
^2024-04-10T20:02:25Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:02:25Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 24^
^2024-04-10T20:02:30Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:02:30Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 25^
^2024-04-10T20:02:36Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:02:36Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 26^
^2024-04-10T20:02:42Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:02:42Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 27^
^2024-04-10T20:02:48Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:02:48Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 28^
^2024-04-10T20:02:53Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:02:53Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 29^
^2024-04-10T20:02:59Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:02:59Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 30^
^2024-04-10T20:03:05Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:05Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 31^
^2024-04-10T20:03:11Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:11Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 32^
^2024-04-10T20:03:17Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:17Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 33^
^2024-04-10T20:03:22Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:22Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 34^
^2024-04-10T20:03:28Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:28Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 35^
^2024-04-10T20:03:34Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:34Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 36^
^2024-04-10T20:03:40Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:40Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 37^
^2024-04-10T20:03:46Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:46Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 38^
^2024-04-10T20:03:51Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:51Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 39^
^2024-04-10T20:03:57Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:03:57Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 40^
^2024-04-10T20:04:03Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:03Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 41^
^2024-04-10T20:04:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:09Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 42^
^2024-04-10T20:04:15Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:15Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 43^
^2024-04-10T20:04:21Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:21Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 44^
^2024-04-10T20:04:26Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:26Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 45^
^2024-04-10T20:04:32Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:32Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 46^
^2024-04-10T20:04:38Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:38Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 47^
^2024-04-10T20:04:44Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:44Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 48^
^2024-04-10T20:04:49Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:49Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 49^
^2024-04-10T20:04:56Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Pod status is: Failed^
^2024-04-10T20:04:56Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Waiting... Retry 50^
^2024-04-10T20:05:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Error: It reachs the total of retries to the POD be healty. Total retries: 50^
^2024-04-10T20:05:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|NAMEREADY STATUSRESTARTS AGE^
^2024-04-10T20:05:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|rundeck-ypfstdjfoqebk 0/1 ContainerStatusUnknown 14m51s^
^2024-04-10T20:05:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Cleaning up...^
^2024-04-10T20:05:01Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The reso
urce may continue to run on the cluster indefinitely.^
^2024-04-10T20:05:02Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|pod "rundeck-ypfstdjfoqebk" force deleted^
^2024-04-10T20:05:03Z|||{node=localhost|step=1|stepctx=1@node\=localhost/1}|configmap "rundeck-configs-ypfstdjfoqebk" deleted^
^2024-04-10T20:05:03Z||ERROR|{node=localhost|step=1|stepctx=1@node\=localhost/1}|Result: 1^
^2024-04-10T20:05:03Z||ERROR|{node=localhost|step=1|stepctx=1@node\=localhost/1}|Failed: NonZeroResultCode: Result code was 1^
^2024-04-10T20:05:03Z|nodeend||{executionState=FAILED|failureReason=NonZeroResultCode|node=localhost|resultCode=1|step=1|stepctx=1@node\=localhost/1}|Result code was 1^
^2024-04-10T20:05:03Z|stepend||{executionState=FAILED|failedNodes=localhost|failureReason=NodeDispatchFailure|node=localhost|step=1|stepctx=1@node\=localhost/1|user=azure_devops}|Node dispat
ch failed^
^2024-04-10T20:05:03Z||ERROR|{node=localhost|step=1|stepctx=1}|Failed: JobFailed: Job [models-python-run-on-k8s] failed^
^2024-04-10T20:05:03Z|nodeend||{executionState=FAILED|failureReason=JobFailed|node=localhost|resultCode=-1|step=1|stepctx=1}|Job [models-python-run-on-k8s] failed^
^2024-04-10T20:05:03Z|stepend||{executionState=FAILED|failedNodes=localhost|failureReason=NodeDispatchFailure|node=localhost|step=1|stepctx=1|user=azure_devops}|Node dispatch failed^
^2024-04-10T20:05:04Z||ERROR|{node=localhost|user=azure_devops}|Execution failed: 327396 in project index-maintenance-mssql: [Workflow result: , step failures: {1=Dispatch failed on 1 nodes:
[localhost: JobFailed: Job [models-python-run-on-k8s] failed + {dataContext=MultiDataContextImpl(map={}, base=null)} ]}, Node failures: {localhost=[JobFailed: Job [models-python-run-on-k8s]
failed + {dataContext=MultiDataContextImpl(map={}, base=null)} ]}, status: failed]^

My job definition is below. Basically it's creates a POD in Kubernetes and do something there. In this specific failed, it fails because the POD was not health. Then, the code exit 1.

#!/bin/bash

... a lot of code here before

# Generate deployment for POD
echo "(+) Creating the deployment file for POD"
CONTAINER_NAME="main"
cat <<EOT> $DEPLOYMENT_PATH
apiVersion: v1
kind: Pod
metadata:
name: $POD_NAME
spec:
containers:
- name: $CONTAINER_NAME
image: $DOCKER_IMAGE_NAME:$DOCKER_IMAGE_TAG
imagePullPolicy: $IMAGE_PULL_POLICY
command: ["/bin/sleep", "7d"]
envFrom:
- configMapRef:
name: $CONFIGMAP_NAME
resources:
requests:
memory: "256Mi"
cpu: "150m"
limits:
memory: "512Mi"
cpu: "1000m"
imagePullSecrets:
- name: $SECRET_NAME
EOT
cat $DEPLOYMENT_PATH

# Deploying the POD
echo "(+) Deploying POD name $POD_NAME"
kubectl apply -f $DEPLOYMENT_PATH -n $NAMESPACE
if [ ! $? -eq 0 ]; then # Rollback
echo "Error applying deployment (POD)."
echo "Cleaning up..."
kubectl delete configmap $CONFIGMAP_NAME -n $NAMESPACE --ignore-not-found
if [ ! $? -eq 0 ]; then echo "Error removing configMap"; fi
exit 1
fi

# Wait for POD to be running
echo "(+) Waiting until pod to be in RUNNING state"
retry=1
while true
do
if [[ $retry -gt $TOTAL_RETRIES ]]; then
echo "Error: It reachs the total of retries to the POD be healty. Total retries: $TOTAL_RETRIES"
# the ignore-not-found flag is necessary to avoid the error: POD not found
kubectl get pod $POD_NAME -n $NAMESPACE --ignore-not-found
echo "Cleaning up..."
kubectl delete pod $POD_NAME -n $NAMESPACE --grace-period=0 --force
if [ ! $? -eq 0 ]; then echo "Error deleting POD"; fi;
kubectl delete configmap $CONFIGMAP_NAME -n $NAMESPACE --ignore-not-found
if [ ! $? -eq 0 ]; then echo "Error deleting CONFIGMAP"; fi;
exit 1
fi

# Get pod status
# the ignore-not-found flag is necessary to avoid the error: POD not found
status=$(kubectl get pod $POD_NAME -n $NAMESPACE --no-headers -o custom-columns=":status.phase" --ignore-not-found)
if [ ! $? -eq 0 ]; then # Rollback
echo "Error getting POD status"
echo "Cleaning up..."
kubectl delete pod $POD_NAME -n $NAMESPACE --grace-period=0 --force
if [ ! $? -eq 0 ]; then echo "Error deleting POD"; fi;
kubectl delete configmap $CONFIGMAP_NAME -n $NAMESPACE --ignore-not-found
if [ ! $? -eq 0 ]; then echo "Error deleting CONFIGMAP"; fi;
exit 1
fi
if [ $status ]; then
echo "Pod status is: $status"
else
echo "Warning! The status is empty. POD probably doesn't exist!"
fi

if [[ $status == 'Running' ]]; then
break
fi
echo "Waiting... Retry $retry"
sleep $SLEEP_IN_SECONDS
retry=$(( retry + 1 ))

done

... more code here

Reply all

Reply to author

Forward