The MLSC compute cluster is a resource shared among several groups. Please adhere to the following recommendations so that we can all use it effectively, happily.
Read the MLSC guideline, so you know how to tell Slurm what resources your job needs.
Adjust your usage to cluster load
Dynamically adjust your usage to the cluster’s workload. Freely request GPUs when available, but restrict usage to about 6-8 GPUs during peak times. Do not block resources that you don’t need. For example, don’t request 80-GB GPUs if you only need 40 GB. If the cluster gets busy while you are using many resources, end jobs that are not crucial. Do not use substantial resources when you are unable to check your email.
Understand how CPU allocation blocks GPUs
Each GPU node has a finite number of CPUs and RAM. Generally, request resources at or below the node’s per-GPU ratio. For example, if a node has 32 CPUs for 8 GPUs, request up to 4 CPUs per GPU. If too many jobs exceed this ratio, and all CPUs have been allocated, nobody can use the remaining GPUs. Generally, submit CPU-only jobs to the basic partition. If you have good reason to deviate from this guideline, adjust your usage to cluster load.
Optimize your resource use
Optimize your code to use the GPU effectively and make sure your job uses the CPUs you requested. Often jobs spend a long time loading or preprocessing data with the CPU and very little time actually using the GPU. You can almost always optimize to speed up your jobs dramatically, helping your research progress while releasing resources for others.
Monitor your jobs and the cluster
Periodically check on your jobs. GPU utilization might drop after some time, which could indicate a problem. Maybe you launched many jobs when the cluster was empty, but now it is busy? You can use the Job Monitor, GPU Monitor, or `ssh` into a job’s node to run `top` and `nvidia-smi`.
Email firstname.lastname@example.org if you have problems or questions. Let people know if you need more resources than usual or desperately need them at busy times. Sometimes users accidentally request too many resources, block GPUs, or forget about an interactive job. It is reasonable to email them individually and politely ask if they are aware. Conversely, when asked how long your jobs take, or if you need the 80-GB GPUs, don’t force others to ask you directly – be proactive about ending jobs.
Last Updated on February 5, 2024 by Paul Raines