The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the NIH Intramural Research Program. If you publish research that involved significant use of Biowulf, please cite the cluster. Suggested citation text: This work utilized the computational resources of the NIH HPC Biowulf cluster ( ).
Quick Links Citation text Job Submission Parallel jobsPartitions Allocating GPUs Interactive jobs Visualization jobs Walltime Limits Licenses Deleting jobs Job States Job Reasons Exit codes Job Dependencies Using local disk Requesting more than one GRES Cluster Status Monitoring Jobs Email notifications Modifying a job after submissionBiowulf Utilities Environment Modules Connecting, Passwords, Email, DiskUse 'ssh biowulf.nih.gov' to connect from the command line. See Connecting to the NIH HPC systems
sinteractive supports, via the -T/--tunnel option, automatically creating SSH tunnels that can be used to access application servers you run within your job. See SSH Tunneling on Biowulf for details.
Re-connecting to interactive sessions: Interactive sessions are terminated if the controlling Biowulf session exits (e.g. laptop drops off the VPN). To maintain the interactive sessions even when you disconnect, we recommend tmux (tmux crash course, quick guide) or screen for text-based sessions. Start your sinteractive session from a tmux/screen window and then disconnect from the tmux/screen session before logging out. Then, when you reconnect to the Biowulf login node, you can re-attach to the tmux/screen session where your interactive session will be waiting for you. To reconnect to graphical sessions, use NX. Please do not run tmux or screen inside of NX.
Use the sacct command to check on the states of completed jobs. Show all your jobs in any state since midnight:sacct Show all jobs that failed since midnight sacct --state f Show all jobs that failed this monthsacct --state f --starttime 2015-07-01 Slurm Job ReasonSlurm Jobs will display reasons for either not running, or ending prematurely. For example QOSMaxCpuPerUserLimit or ReqNodeNotAvail.An explanation for those reasons can be found here: _limits.html#reasons
This script tries to load a non-existent GATK version and then calls GATK.This will fail. However, bash by default keeps executing even if commands fail,so the script will eventually print 'DONE'. Since the exit status of a bashscript is the exit status of the last command and echo returns 0 (SUCCESS), thescript as a whole will exit with an exit code of 0, signalling sucess and thejob state will show COMPLETED since SLURM uses the exit code tojudge if a job completed sucessfully.
Similarly, if a command in the middle of the job script were killed forexceeding memory, the rest of the job script would still be executed and couldpotentially return an exit code of 0 (SUCCESS), resulting again in a state ofCOMPLETED.
A more selective approach involves carefully checking the exit codes ofthe important parts of a job script. This can be done with conventional if/elsestatements or with conditional short circuit evaluation often seen in scripts.For example:
Please note that the Slurm batch system may occasionally be shut down either briefly for a necessary configuration change or for longer periods when system maintenance is underway. In these situations, a "downtime maintenance" script will be installed in place of the normal Slurm commands (sbatch, squeue, etc.). This downtime script will terminate with an exit code of 123. This provides an easy way for job scripts, workflows and/or pipelines to test if the batch system if offline for maintenance. You can test for exit code 123 and, if found, can know to try your request again later. For example:
Video: Utilizing local disk on Biowulf nodes (10 mins) Each Biowulf node has some amount of local disk available for use.For most nodes this is generally 800GB of fast solid state storage. A limited number have 2400GB (2.4TB) . Use thefreen command to see how much is available on each node type. Forjobs that read/write lots of temporary files during the run, it may beadvantageous to use the local disk as scratch or temp space.
Note that each subjob in a swarm will have a separate lscratch directory.That means that lscratch cannot be used to share data between subjobs. Commandsbundled into a single subjob with the -b option to swarm will all share thesame lscratch directory, however.
When the job is terminated, all data in /lscratch/$SLURM_JOB_IDdirectory will be automatically deleted. Any data that needs to be savedshould be copied to your /data directory before the job concludes.
Performance of lscratch will suffer for all users on a node when largenumbers of files are created in a single directory. Please avoid thesesituations by either removing files no longer needed for the ongoing job , orstructure your data differently (subdirectories, sqlite3 database, pythonshelf, ...).
We provide a script that can be used during during development or debugging to monitor lscratch usage. The script is called lscratch_mon. It can be used in the foreground from a second shell on a compute node where a job is running or can be added as a background processes to a batch script. It has two modes - the default simple mode which reports the total amount of storage used by a user on the node which includes files from any jobs in /lscratch or /tmp. A detail mode distinguishes between the different locations at the cost of a slighly higher computational cost.
TMPDIR is a near-universally agreed upon environment variable that defines where a programwill write temporary files. By default, Unix systems set the value of TMPDIR to /tmp. On the Biowulfcluster, leaving TMPDIR set to /tmp can lead to problems due to:
Because of this, users are strongly encouraged to allocate local scratch disk for their jobs,as well as setting TMPDIR to that local scratch disk. Because local scratch is not defined untilthe job begins running, setting TMPDIR must be done either within the batch script:
The main difference between TechDocs Basic and Recommended deployment approachis where the docs are generated and stored. In Basic or the out-of-the-boxsetup, docs are generated and stored at the server running your Backstageinstance. But the recommended setup is to generate docs on CI/CD and store thegenerated sites to an external storage (e.g. AWS S3 or GCS). TechDocs in yourBackstage instance should turn into read-only mode. Read more details and thebenefits in the TechDocs Architecture.
If TechDocs is configured to generate docs, it will first download source filesbased on the value of the backstage.io/techdocs-ref annotation defined in theEntity's catalog-info.yaml file. This is also called thePrepare step.
We strongly recommend that the backstage.io/techdocs-ref annotation in eachdocumented catalog entity's catalog-info.yaml be set to dir:. in almost allsituations. This is because TechDocs is aligned with the "docs like code"philosophy, whereby documentation should be authored and managed alongside thesource code of the underlying software itself.
If, for example, you wanted to keep a lean root directory, you could place yourmkdocs.yml file in a subdirectory and update the backstage.io/techdocs-refannotation value accordingly, e.g. to dir:./sub-folder:
In rare situations where your TechDocs source content is managed and stored in alocation completely separate from your catalog-info.yaml, you can insteadspecify a URL location reference, the exact value of which will vary based onthe source code hosting provider. Notice that instead of the dir: prefix, theurl: prefix is used instead. For example:
Note, just as it's possible to specify a subdirectory with the dir: prefix,you can also provide a path to a non-root directory inside the repository whichcontains the mkdocs.yml file and docs/ directory.
URL Reader uses the source code hosting provider to download a zip or tarball ofthe repository. The archive does not have any git history attached to it. Alsoit is a compressed file. Hence the file size is significantly smaller than howmuch data git clone has to transfer.
TechDocs uses a composability pattern similar to the Search and Catalog pluginsin Backstage. While a default table experience, similar to the one provided bythe Catalog plugin, is made available for ease-of-use, it's possible for you toprovide a completely custom experience, tailored to the needs of yourorganization. For example, TechDocs comes with an alternative grid based layout() and panel layout (TechDocsCustomHome).
But you can replace with any React component, whichwill be rendered in its place. Most likely, you would want to create andmaintain such a component in a new directory atpackages/app/src/components/techdocs, and import and use it in App.tsx:
Similar to how it is possible to customize the TechDocs Home, it is alsopossible to customize the TechDocs Reader Page. It is done in your apppackage. By default, you might see something like this in your App.tsx:
This guide only applies to the "recommended" TechDocs deployment method (wherean external storage provider and external CI/CD is used). If you use the"basic" or "out-of-the-box" setup, you can stop here! No action needed.
The beta version of TechDocs made a breaking change to the way TechDocs contentwas accessed and stored, allowing pages to be accessed with case-insensitiveentity triplet paths (e.g. /docs/namespace/kind/name whereas in priorversions, they could only be accessed at /docs/namespace/Kind/name). In orderto enable this change, documentation has to be stored in an external storageprovider using an object key whose entity triplet is lower-cased.
New installations of TechDocs since the beta version will work fine with noaction, but for those who were running TechDocs prior to this version, amigration will need to be performed so that all existing content in your storagebucket matches this lower-case entity triplet expectation.
Ensure you have the right permissions on your storage provider: In orderto migrate files in your storage provider, the techdocs-cli needs to beable to read/copy/rename/move/delete files. The exact instructions vary bystorage provider, but check the using cloud storagepage for details.
c80f0f1006