Using the MLSC Compute Cluster

Overview

The MSLC Compute Cluster is a mix of GPU and non-GPU compute nodes and high performance storage funded by the Massachusetts Life Sciences Center to support research at the Athinoula A. Martinos Center for Biomedical Imaging and the CTRU/IBC. The cluster provides 1200 CPU cores, 16 RTX6000 GPUs, 44 RTX8000 GPUs and 32 A100 GPUs along with 1 petabyte of flash based high performance storage.

Getting Started

Individual research labs and/or projects need their own “slurm group account” to assign jobs to when submitting.  A lab can have one slurm group account for the whole lab or multiple accounts for different grants/projects (very much like scanner projects for users of the Martinos Center MRI bays).

The lab PI should email help@nmr.mgh.harvard.edu with the preferred account name (8 characters or less – most common to use your labs most general UNIX group name if you have one) and background about your research project(s) and the list of users who will be submitting jobs for this project. Each user needs a “full login” Martinos Linux user account if they do not already have one.  Full login accounts have a fee of $16.67/month.

Once your account/project is approved, you can ssh to the MLSC cluster master node:

ssh -l username mlsc.nmr.mgh.harvard.edu

The -l username can be omitted if you are connecting from another Martinos Center linux box. After you login, verify that you are in the slurm system by running

sshare -U -u $USER

and verifying your name is listed under at least one slurm account.

Read fully the information below about how to use the cluster and follow the MLSC Cluster Etiquette guidelines.

Please cite the Massachusetts Life Sciences Center on your posters and publications for any data analyzed on this cluster.

SLURM Terminology

  • SLURM group account – group of users working on same projects
    • Possible for Lab to have multiple group accounts if need demonstrated
    • Fairshare priority system is currently implemented by these groups, not by users
  • job – submission of analysis that defines resource allocations needed
  • partition – a group of nodes with a certain set of resource limits and defaults
    • equivalent of queues in PBS/torque (In SLURM, queue is just the job queue)
  • node – one of the physical servers (A100 DGX, EXXACT RTX, Dell R440)
  • resource – memory, time, GPU, cores, etc required by job
  • task – division of a job with multiple processes (ex: MPI).  Typically just one per job.
    • Separate programs/processes you run at the same time in the same job
  • cpu/core –  physical CPUs are called sockets in SLURM.  Hyperthreading is off.
    • number of cores usable by multi-threading capable programs like MATLAB

Computational Resources

The equipment bought on the MLSC grant consists of:

  • four NVIDIA DGX A100 servers with eight NVIDIA A100 GPUs, two AMD EPYC 7742 64 core CPUs, 1TB of memory, and ten 100Gbe network connections
  • five EXXACT servers with ten NVIDIA RTX8000 GPUs, two Intel Xeon Gold 6226R 16 core CPUs, 1.5TB of memory, and one 100Gbe network connection
  • two EXXACT servers with eight NVIDIA RTX6000 GPUS, two Intel Xeon Gold 5218 16 core CPUs, 1.5TB of memory, and one 100Gbe network connection
  • thirty-two DELL R440 servers with two Intel Xeon Silver 4214R 12 core CPUs, 384GB of memory, and one 25Gbe network connection (two used for command/login)
  • a 1.35PB VAST all FLASH storage appliance with eight 100Gbe network connections
  • four 100Gbe switches with over 100 total ports uplinked to MGB corporate network via 100Gbe uplink
  • three full sized 42U racks with necessary PDUs for power, cables and floating KVM

The DGX systems are running Ubuntu 18.04 while all other systems are running CentOS8

For purposes of submitting your jobs on the SLURM cluster you should use this table to identify the partition to use and to plan how to ask for CPU, GPU, and memory resources.

Partition #Nodes #Cores RAM #GPU Model Scratch
basic 30 24 373 GB (15GB/core) none Dell PowerEdge R440 1.5TB
rtx6000 2 32 1.5TB (45GB/core) 8 RTX 6000 (4 cores/GPU) RTX 6000 GPU Server 7.0TB
rtx8000 5 32 1.5TB (45GB/core) 10 RTX 8000 (~3 cores/GPU) RTX 8000 GPU Server 7.0TB
dgx-a100 4 128 1.0TB (7GB/core) 8 A100 (16 cores/GPU) NVIDIA DGX A100 14TB

What are called queues in the launchpad Torque/PBS system are called partitions in SLURM — group of nodes (and thus resources) you submit to. In SLURM a queue refers to only the job submission queue.

The default time giving a job is just 4 hours unless you ask for more — to a max of 7 days. The –time format above is days-hours:minutes:seconds

The default memory for a job if not specified is just 10MB per core requested.

A “node” is the physical server box itself. If a box has 10 CPUs, and you ask for 2 nodes with 5 CPUs per node, it is possible you get all 10 CPUs of one box or 5 CPUs on one box and 5 CPUs on another. CPUs in SLURM are just cores as we have disabled HyperThreading to follow HPC best practice. Therefore there is only one thread per core.

Submitting batch jobs in SLURM

The generic command to submit in slurm is ‘sbatch’.  However, there is a custom job submission wrapper script called jobsubmit you can use to make simple submissions.  Run

/usr/local/bin/jobsubmit -h

for instructions on how to use it. NOTE: options for jobsubmit are not the same as for sbatch/srun.

Specifying memory and time limits are mandatory in order to force users think about their job resources more carefully.   For this reason note that on the RTX GPU boxes there are on average ~3 CPU cores and 128GB RAM per GPU.  On the Dell nonGPU nodes there are about 15GB RAM per CPU core.

It is also mandatory to specify the partition (aka queue) and slurm account (project/lab). You do not need to put the command you submit in quotes. Just make sure the command and all its args come after all the jobsubmit options. Note also by default no emails are sent unless the -M option is given. Example:

jobsubmit -p rtx8000 -A sysadm -m 128G -t 1-12:00:00 -c 3 -G 1 \
-M END,FAIL python3 my_gpu_job.py -d /cluster/data/subject –epochs 100

This jobsubmit script will actually create a sbatch submission script in the directory named /cluster/batch/username with the name sjob_subnumber.

Also your job’s terminal Standard Input and Output will be written to a file in that same directory with name sjob_subnumber.outjobnumber. There are options to split them to two different files. Files older than 15 days are removed from /cluster/batch nightly. So make sure to copy the output elsewhere when the job is done if you need to keep it.

If you choose to use sbatch instead of jobsubmit you must write a script to submit. You can embed submission options in the header script as well as give options on the command line. Example:

=========================testjob.sh==========================
#!/bin/bash
#SBATCH --account=sysadm
#SBATCH --partition=basic
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --mem=2G
#SBATCH --time=0-01:15:00

cd /autofs/cluster/batch/raines/john-1.9.0-jumbo-1/run/
./john --test=300 --format=md5crypt
=============================================================

 

sbatch –job-name=john001 –output=john001_%j.out \
–mail-type=END,FAIL testjob.sh

NOTE:  the options to sbatch (and srun) are not the same as the jobsubmit wrapper. See the man pages for sbatch and srun for details.

You can submit a GPU job to both rtx6000 and rtx8000 partitions by specifying:

-p rtx6000,rtx8000

and it will take the first available.

Interactive Jobs

To run an interactive job you need to use the ‘srun’ command. You should stick to single node jobs when doing this (e.g. -N 1). An example is this to get 2 RTX8000 GPUs for 1 day 10 hours with 6 CPU cores and 256 RAM.

srun -p rtx8000 -A sysadm -N 1 --ntasks-per-node=1 --gpus=1 --mem=256G \
  --time=1-10:00:00 --cpus-per-task=3 --pty /bin/bash

 

If you want to be able to run X11 apps add ––x11=first to the options.

It is important for users to actually use the shells on the nodes they get this way and not leave them idle overnight.  Fairshare weighting applies actually clock time used, not CPU time, so idle time affects your future job priority just as much as time spent using 100% the CPU or GPU.

Job Arrays

If you have multiple jobs to run using the same command with slight change in input args, you should use job arrays. This also lets you limit the number of jobs that are running at once (for instance to not overload your desktop workstation where the data is being pulled from or to not hog all the nodes if you are high priority). Your sbatch job script would look something like this:

=========================testjob.sh==========================
#!/bin/bash
#SBATCH --account=sysadm
#SBATCH --partition=rtx8000
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=32G
#SBATCH --gpus=1
#SBATCH --mail-type=FAIL
#SBATCH --time=0-01:15:00
#SBATCH --output=slurm-%A_%a.out  # each job will output to separate file in current dir
#SBATCH --array=1-50%5  # Submit 50 tasks with with only 5 of them running at any one time

export SUBJECTS_DIR=/vast/itgroup/subjects
source /usr/local/freesurfer/nmr-stable6-env-bash

#contains the list of 50 subjects to run on
subj_file=$SUBJECTS_DIR/listof50subjs.txt

# Get "SLURM_ARRAY_TASK_ID"th subject from file
subj=$(cat $subj_file | sed -n "${SLURM_ARRAY_TASK_ID}p"

# Run the command on selected subject
fc-cuda-analyze -s $subj -x -c $SUBJECTS_DIR/input.cfg

=============================================================

NOTE: using #SBATCH –array=10-20%5 will run only on the 10th thru 20th lines (total 11 jobs) of the subject file

A more detailed overview of job arrays can be found on this UFL page.

SCRATCH space

Each node has a local /scratch directory that should be used for temporary files instead of /tmp.  There is also the network directory /vast/scratch shared by all nodes.

Files in both local and VAST scratch directories are removed after 7 days old but we will soon implement a scheme where files are removed after 3 days unless they are in a directory with the same name as a running job on the node (e.g. /scratch/29770).

The TMPDIR environment variable is set to /scratch/$SLURM_JOBID in the default environment for each job. This is automatically created via prolog wrapper before the user part of the jobs starts and will be automatically removed when the job finishes.

MATLAB usage

We have a limited number of MATLAB licenses and running MATLAB jobs that use these licenses will oversubscribe our license server. For that reason users should only run at maximum 5 jobs that use a MATLAB license and submit those jobs with the –license=matlab:1 option.

You can still run more than five MATLAB jobs if you compile your jobs ahead of time as compile matlab code does not use a license. This is done using the deploytool.

Singularity container jobs

The cluster does not support Docker but does support running Docker containers converted to Singularity images. The best performance on the GPUs can almost always be obtained by using the optimized containers directly from NVIDIA. For more information see our Using Singularity on the MLSC Cluster page.

SLURM tools

  • squeue to see what is queued/running
  • sinfo for general cluster state
  • sbatch, srun or jobsubmit (as above) to submit
  • sshare to see your slurm account fairshare stats
  • scontrol show job=job#” for more details while job is running
  • scancel to cancel jobs
  • sattach to view stdout/stderr of running jobs

Use the man command to see more help on each command. There are PBS wrappers on qstat, qsub, qdel but have limited options they support and output is often a bit different.

For finished jobs, a good way to see stats is like this:

sacct --starttime 2020-09-08 -o "JobID,Elapsed,CPUTime,MaxRSS,TotalCPU,State"

For more generic details about SLURM read the docs at https://slurm.schedmd.com/

Custom tools

I have created a few custom tools that are wrappers around the above commands to make getting certain information easier:

  • jobinfo <jobnum#>: give detailed info on running/completed jobs such as resources used
  • showpending: show jobs pending in the queue in priority order
  • shownodes: show all nodes with their current resources allocated to see how busy things are and where resources may still be available
  • showrunning: show all running jobs listing node, time elapsed and resource reserved
  • showrunusage: show running jobs listing node, time elapsed with cpu time consumed, GPU utilization average, GPU max mem

A web page that shows a running job monitor that you can filter to just your jobs is at https://www.nmr.mgh.harvard.edu/martinos/secure/mlsc/jobmon.php

Fairshare priority system

All jobs must be submitted under an account which are based on labs and/or projects. Users cannot use SLURM unless they are assigned to at least one account.

Each SLURM group account gets the same 200 “raw” shares which is basically an arbitrary number big enough to give some granularity.  What matters is relative value compared to other groups.  But since right now everyone has the same, it is immaterial.

Job priority in the submission queue is based on the group account’s share weighted by a usage factor by the resources used by the group. Each resource (GPU, CPU, memory) has a weight with “better” resources having more weight than slower ones (i.e. a A100 GPU has more weight than a RTX8000 which has more than a RTX6000). The older the usage, the less its weight with a half-life factor of 2days (a job you ran a 2 days ago will “weight down” your fairshare half as much as an equivalent job you just ran today).  The half-life parameter is likely one will change often to tune the system.

Once queued, the priority of your job will go up as it ages. But only the top 8 jobs your SLURM group has in the queue will gain the age-based priority.

The final fairshare factor for a slurm group account is normalized to be between 1.0 (account has run no jobs recently so has highest priority) and 0 (no share left – account will only have jobs run when nodes are free and no accounts with non-zero share have jobs waiting).

One can use the sshare command to monitor the Fairshare system.  Below results are obtained for example using the following command on the MLSC login node.

# sshare -o Account,RawShares,RawUsage,FairShare                           
             Account  RawShares    RawUsage  FairShare
-------------------- ---------- ----------- ----------
root                              703742066   0.500000
 root                         1           0   1.000000
 arnoldgp                   200           0   1.000000
 bandlab                    200           0   1.000000
 brainbes                   200     3432399   0.950533
 carplab                    200           0   1.000000
 fsdev                      200   135319956   0.135358
 imanlab                    200         107   0.999998
 lcn                        200   334070210   0.007172
 lcnrtx                     200   136947934   0.132108
 lfilab                     200           0   1.000000
 purdongp                   200        3653   0.999946
 qtim                       200     4190995   0.939941
 roffmagp                   200    89395433   0.266923
 sysadm                     200      381374   0.994379
 visuo                      200           0   1.000000
 zolleigp                   200           0   1.000000
                          

RawUsage is the groups usage weighting factor with the 28 day halflife factor applied to all jobs that group has run according to the resource weights of each job.

FairShare is the normalized fairshare weight percentage applied to the priority of new jobs submitted for that account.

Use a command like the following to see an accounting of the usage of your group per user.

# sshare --account=lcn -a -o Account,User,RawShares,RawUsage,FairShare
         Account       User  RawShares    RawUsage  FairShare
---------------- ---------- ---------- ----------- ----------
lcn                                200   334067236   0.007168
 lcn                  avd12     parent     5983822   0.007168
 lcn               cmagnain     parent      578823   0.007168
 lcn                 fischl     parent   171850255   0.007168
 lcn               iglesias     parent           0   0.007168
 lcn                   mu40     parent   155654334   0.007168

A more detailed description of the FairShare system can be found at Harvard FAS slurm cluster.

VAST Storage

Volume on the VAST storage system are available with a 10TB limit per lab. This should be used for active projects using the MLSC cluster only staging to and from other storage as needed. The price is the same $26.67/TB/month as our primary Dell storage volumes.

This storage is Ultra High performance only over NFS to the cluster nodes. It will give better performance over NFS than Dell storage to non-cluster Martinos Linux systems. SMB/CIFS access to VAST available by Dell gateway proxy server is available if needed. Each volume gets a weekly mirror backup to Dell storage

Last Updated on February 7, 2024 by Paul Raines