Cluster Terminated. Reason Storage Download Failure

18 views

Skip to first unread message

Pavan Outlaw

unread,

Jul 23, 2024, 9:50:54 PM7/23/24

to kirsbranidmen

I'm using Azure Databricks with a custom configuration that uses vnet injection and I am unable to start a cluster in my workspace.The error message being given is not documented anywhere in microsoft or databricks documentation meaning I am unable to diagnose the reason why my cluster is not starting. I have reproduced the error-message below:

cluster terminated. reason storage download failure

Download File — https://urlca.com/2zIo2U

I am also running into this error. But: I cannot try to restart the machine because the error is passed back to the terraform agent and when looking into the Databricks workspace there is no compute cluster there after the terraform apply failure.

During one of these failures I noticed that the IP address of the Artifact Blob storage primary was different from what I've configured in the UDR for my region. In this exact moment I realized that the IP addresses of those services in this website are dyanmic. The problem is that you cannot put hostnames in the UDR in Azure :( So this is exactly what I did:

When you submit a step job in your EMR cluster, you can specify the step failure behavior in the ActionOnFailure parameter. The EMR cluster terminates if you select TERMINATE_CLUSTER or TERMINATE_JOB_FLOW for the ActionOnFailure parameter. For more information, see StepConfig.

To avoid the preceding error, launch an EMR cluster with a higher instance type to leverage more memory for your cluster's requirements. Also, clean up disk space to avoid memory outages in long running clusters. For more information, see How do I troubleshoot primary node failure with error "502 Bad Gateway" or "504 Gateway Time-out" in Amazon EMR?

When an EMR cluster terminates with an error, the DescribeCluster and ListClusters APIs return an error code and an error message. For some cluster errors, the ErrorDetail data array can also help you troubleshoot the failure. For more information, see Error codes with ErrorDetail information.

In Kubernetes, a container or pod may be restarted for a number of reasons, including to recover from runtime failures, to update the application or configuration, or due to resource constraints. If a container or pod experiences an OOMKilled error, it may be restarted automatically by Kubernetes, depending on the configuration of your cluster. Kubernetes provides multiple options for you to control how often and under what conditions a container or pod should be restarted.

The "USE WITH CAUTION" warnings are there for a reason. Just as an example, as early versions of this article did not document these risks properly, I saw somebody forcing the removal of a project without all the services having been removed, so there were services in the API for a non-existing project, which in turn caused the OpenShift SDN to start misbehaving in a horrible manner that caused great cluster-wide impact.

To confirm if you have OSDs in your cluster, connect to the Rook Toolbox and run the ceph status command. You should see that you have at least one OSD up and in. The minimum number of OSDs required depends on the replicated.size setting in the pool created for the storage class. In a "test" cluster, only one OSD is required (see storageclass-test.yaml). In the production storage class example (storageclass.yaml), three OSDs would be required.

One common case for failure is that you have re-deployed a test cluster and some state may remain from a previous deployment. If your cluster is larger than a few nodes, you may get lucky enough that the monitors were able to start and form quorum. However, now the OSDs pods may fail to start due to the old state. Looking at the OSD pod logs you will see an error about the file already existing.

In some circumstances, Karpenter controller can fail to start up a node.For example, providing the wrong block storage device name in a custom launch template can result in a failure to start the node and an error similar to:

The handler proceeds making the StopTask call with the information stored in backend database such as the ECS cluster ARN, task ID, and reason it received from the termination event. The AWS Batch service role associated with the compute environment makes the API call and the job moves to FAILED state. This API call gets logged into CloudTrail made by the user aws-batch:

What does"Warning: Note very large processing time"in the SlurmctldLogFile indicate?
This error is indicative of some operation taking an unexpectedlylong time to complete, over one second to be specific.Setting the value of the SlurmctldDebug configuration parameterto debug2 or higher should identify which operation(s) areexperiencing long delays.This message typically indicates long delays in file system access(writing state information or getting user information).Another possibility is that the node on which the slurmctlddaemon executes has exhausted memory and is paging.Try running the program top to check for this possibility.Is resource limit propagationuseful on a homogeneous cluster?
Resource limit propagation permits a user to modify resource limitsand submit a job with those limits.By default, Slurm automatically propagates all resource limits ineffect at the time of job submission to the tasks spawned as partof that job.System administrators can utilize the PropagateResourceLimitsand PropagateResourceLimitsExcept configuration parameters tochange this behavior.Users can override defaults using the srun --propagateoption.See "man slurm.conf" and "man srun" for more informationabout these options.Do I need to maintain synchronizedclocks on the cluster?
In general, yes. Having inconsistent clocks may cause nodes tobe unusable. Slurm log files should contain references toexpired credentials. For example:error: Munge decode failed: Expired credentialENCODED: Wed May 12 12:34:56 2008DECODED: Wed May 12 12:01:12 2008Why are "Invalid job credential"errors generated?
This error is indicative of Slurm's job credential files being inconsistent acrossthe cluster. All nodes in the cluster must have the matching public and privatekeys as defined by JobCredPrivateKey and JobCredPublicKey in theSlurm configuration file slurm.conf.Why are"Task launch failed on node ... Job credential replayed"errors generated?
This error indicates that a job credential generated by the slurmctld daemoncorresponds to a job that the slurmd daemon has already revoked.The slurmctld daemon selects job ID values based upon the configuredvalue of FirstJobId (the default value is 1) and each job getsa value one larger than the previous job.On job termination, the slurmctld daemon notifies the slurmd on eachallocated node that all processes associated with that job should beterminated.The slurmd daemon maintains a list of the jobs which have already beenterminated to avoid replay of task launch requests.If the slurmctld daemon is cold-started (with the "-c" optionor "/etc/init.d/slurm startclean"), it starts job ID valuesover based upon FirstJobId.If the slurmd is not also cold-started, it will reject job launch requestsfor jobs that it considers terminated.This solution to this problem is to cold-start all slurmd daemons wheneverthe slurmctld daemon is cold-started.Can Slurm be used with Globus?
Yes. Build and install Slurm's Torque/PBS command wrappers along withthe Perl APIs from Slurm's contribs directory and configureGlobus to use those PBS commands.Note there are RPMs available for both of these packages, namedtorque and perlapi respectively.What causes the error"Unable to accept new connection: Too many open files"?
The srun command automatically increases its open file limit tothe hard limit in order to process all of the standard input and outputconnections to the launched tasks. It is recommended that you set theopen file hard limit to 8192 across the cluster.Why does the setting of SlurmdDebugfail to log job step information at the appropriate level?
There are two programs involved here. One is slurmd, which isa persistent daemon running at the desired debug level. The secondprogram is slurmstepd, which executes the user job and itsdebug level is controlled by the user. Submitting the job withan option of --debug=# will result in the desired level ofdetail being logged in the SlurmdLogFile plus the outputof the program.Why aren't pam_slurm.so, auth_none.so, or other components in aSlurm RPM?
It is possible that at build time the required dependencies for building thelibrary are missing. If you want to build the library then install pam-develand compile again. See the file slurm.spec in the Slurm distribution for a listof other options that you can specify at compile time with rpmbuild flagsand your rpmmacros file.The auth_none plugin is in a separate RPM and not built by default.Using the auth_none plugin means that Slurm communications are notauthenticated, so you probably do not want to run in this mode of operationexcept for testing purposes. If you want to build the auth_none RPM thenadd --with auth_none on the rpmbuild command line or add%_with_auth_none to your /rpmmacros file. See the file slurm.specin the Slurm distribution for a list of other options.Why should I use the slurmdbd instead of theregular database plugins?
While the normal storage plugins will work fine without the addedlayer of the slurmdbd there are some great benefits to using theslurmdbd.

Added security. Using the slurmdbd you can have an authenticatedconnection to the database.
Offloading processing from the controller. With the slurmdbd there is noslowdown to the controller due to a slow or overloaded database.
Keeping enterprise wide accounting from all Slurm clusters in one database.The slurmdbd is multi-threaded and designed to handle all theaccounting for the entire enterprise.
With the database plugins you can query with sacct accounting stats fromany node Slurm is installed on. With the slurmdbd you can also query anycluster using the slurmdbd from any other cluster's nodes. Other tools likesreport are also available.

How can I build Slurm with debugging symbols?
When configuring, run the configure script with --enable-developer option.That will provide asserts, debug messages and the -Werror flag, thatwill in turn activate --enable-debug.
With the --enable-debug flag, the code will be compiled with-ggdb3 and -g -O1 -fno-strict-aliasing flags that will produceextra debugging information. Another possible option to use is--disable-optimizations that will set -O0.See also auxdir/x_ac_debug.m4 for more details.How can I easily preserve drained nodeinformation between major Slurm updates?
Major Slurm updates generally have changes in the state save files andcommunication protocols, so a cold-start (without state) is generallyrequired. If you have nodes in a DRAIN state and want to preserve thatinformation, you can easily build a script to preserve that informationusing the sinfo command. The following command line will report theReason field for every node in a DRAIN state and write the outputin a form that can be executed later to restore state.sinfo -t drain -h -o "scontrol update nodename='%N' state=drain reason='%E'"Why doesn't the HealthCheckProgramexecute on DOWN nodes?
Hierarchical communications are used for sending this message. If thereare DOWN nodes in the communications hierarchy, messages will need tobe re-routed. This limits Slurm's ability to tightly synchronize theexecution of the HealthCheckProgram across the cluster, whichcould adversely impact performance of parallel applications.The use of CRON or node startup scripts may be better suited to ensurethat HealthCheckProgram gets executed on nodes that are DOWNin Slurm.What is the meaning of the error"Batch JobId=# missing from batch node (not found BatchStartTime after startup)"?
A shell is launched on node zero of a job's allocation to executethe submitted program. The slurmd daemon executing on each computenode will periodically report to the slurmctld what programs itis executing. If a batch program is expected to be running on somenode (i.e. node zero of the job's allocation) and is not found, themessage above will be logged and the job canceled. This typically isassociated with exhausting memory on the node or some other criticalfailure that cannot be recovered from.What does the message"srun: error: Unable to accept connection: Resources temporarily unavailable"indicate?
This has been reported on some larger clusters running SUSE Linux whena user's resource limits are reached. You may need to increase limitsfor locked memory and stack size to resolve this problem.How could I automatically print a job'sSlurm job ID to its standard output?
The configured TaskProlog is the only thing that can write tothe job's standard output or set extra environment variables for a jobor job step. To write to the job's standard output, precede the messagewith "print ". To export environment variables, output a line of thisform "export name=value". The example below will print a job's Slurmjob ID and allocated hosts for a batch job only.#!/bin/sh## Sample TaskProlog script that will print a batch job's# job ID and node list to the job's stdout#if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ]then echo "print ==========================================" echo "print SLURM_JOB_ID = $SLURM_JOB_ID" echo "print SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST" echo "print =========================================="fiWhy are user processes and srunrunning even though the job is supposed to be completed?
Slurm relies upon a configurable process tracking plugin to determinewhen all of the processes associated with a job or job step have completed.Those plugins relying upon a kernel patch can reliably identify every process.Those plugins dependent upon process group IDs or parent process IDs are notreliable. See the ProctrackType description in the slurm.confman page for details. We rely upon the cgroup plugin for most systems.