Typhon V7

0 views

Skip to first unread message

Demetrius Dade

unread,

Aug 3, 2024, 6:09:31 PM8/3/24

to omjochamhy

Overview
For large parallel computations and batch jobs, IAS has a 64 node beowulf cluster, named Typhon. Each node has quad 24 core 64-bit Intel Cascade Lake processors, providing a total of of 6144 processor cores. Each node has 384 GB RAM (4 GB/core). For low-latency message passing, all the nodes are interconnected using HDR100 infiniband.

All nodes mount the same /home and /data filesystems as the other computers in SNS. Scratch space locations have been tweaked to help identify local vs network resources. /scratch/lustre is the new mount point for the parallel file system and /scratch/local/ will be for any system local storage.

Login Nodes
The primary login nodes, typhon-login1 and typhon-login2, should be used for interactive work such as compiling programs and submitting jobs. Please remember that these are shared resources for all users.

Access to /data, /home/ and /scratch file systems are available on all login and cluster nodes.

All nodes have access to our parallel filesystem through /scratch/lustre.
600GB of local scratch is available on each node in /scratch/local/.

Job Scheduling
The cluster determines job scheduling and priority using Fair Share. This is a score determined per user based on past usage; the more jobs that you run the lower your score will temporarily be.

The current maximum allowed time is 168 hours or 7 days. Users needing to run jobs for longer than the maximum time window should add the capability to utilize restart files into their jobs so that they comply with these limits.