MLSC Cluster Etiquette

The MLSC compute cluster is a resource shared among several groups. Please adhere to the following recommendations so that we can all use it effectively, happily.

Understand Slurm

Read the MLSC guideline, so you know how to tell Slurm what resources your job needs.

Adjust your usage to cluster load

Dynamically adjust your usage to the cluster’s workload. Freely request GPUs when available, but restrict usage to about 6-8 GPUs during peak times. Do not block resources that you don’t need. For example, don’t request 80-GB GPUs if you only need 40 GB. If the cluster gets busy while you are using many resources, end jobs that are not crucial. Do not use substantial resources when you are unable to check your email.

Also use JOB ARRAYs to limit the number of jobs you have running when squeue shows the partitions are in high demand

Understand how CPU allocation blocks GPUs

Each GPU node has a finite number of CPUs and RAM. Generally, request resources at or below the node’s per-GPU ratio. For example, if a node has 32 CPUs for 8 GPUs, request up to 4 CPUs per GPU. If too many jobs exceed this ratio, and all CPUs have been allocated, nobody can use the remaining GPUs. Generally, submit CPU-only jobs to the basic partition. If you have good reason to deviate from this guideline, adjust your usage to cluster load.

Interactive jobs

End interactive jobs and free resources when you no longer use them. You can save and restore MATLAB and Jupyter sessions.

Optimize your resource use

Optimize your code to use the GPU effectively and make sure your job uses the CPUs you requested. Often jobs spend a long time loading or preprocessing data with the CPU and very little time actually using the GPU. You can almost always optimize to speed up your jobs dramatically, helping your research progress while releasing resources for others.

Monitor your jobs and the cluster

Periodically check on your jobs. GPU utilization might drop after some time, which could indicate a problem. Maybe you launched many jobs when the cluster was empty, but now it is busy? You can use the Job Monitor, GPU Monitor, or `ssh` into a job’s node to run `top` and `nvidia-smi`.   Check how well your completed jobs used the resources requested using ‘jobinfo’.

Communicate, graciously

Email batch-users@nmr.mgh.harvard.edu if you have problems or questions. Let people know if you need more resources than usual or desperately need them at busy times. Sometimes users accidentally request too many resources, block GPUs, or forget about an interactive job. It is reasonable to email them individually and politely ask if they are aware. Conversely, when asked how long your jobs take, or if you need the 80-GB GPUs, don’t force others to ask you directly – be proactive about ending jobs.

Last Updated on April 11, 2024 by Paul Raines