New MLSC Compute Cluster – Martinos IT Support

OVERVIEW

In the fall of 2020, the Martinos Center will bring online a new compute cluster bought by a grant from the Massachusetts Life Sciences Center (MLSC). This cluster will have new CPU resources a decade newer than our old Launchpad cluster as well significant GPU resources.

four NVIDIA DGX A100 servers with eight NVIDIA A100 GPUs, two AMD EPYC 7742 64 core CPUs, 1TB of memory, and ten 100Gbe network connections
five EXXACT servers with ten NVIDIA RTX8000 GPUs, two Intel Xeon Gold 6226R 16 core CPUs, 1.5TB of memory, and one 100Gbe network connection
one EXXACT servers with two Intel Xeon Gold 6226R 16 core CPUs and 1.5TB of memory and one 100Gbe network connection
two EXXACT servers with eight NVIDIA RTX6000 GPUS, two Intel Xeon Gold 5218 16 core CPUs, 1.5TB of memory, and one 100Gbe network connection
thirty-two DELL R440 servers with two Intel Xeon Silver 4214R 12 core CPUs, 384GB of memory, and one 25Gbe network connection
a 1.35PB VAST Storage appliance with eight 100Gbe newtork connections

STATUS

All equipment is currently racked at the Marlborough Data Center except for a critical 64-port 100Gbe network switch needed to enable super-fast networking between the A100 and EXXACT GPU nodes and uplink directly to the Partners corporate network. Shipping is still showing early October. Till then we are using a private network behind our storage server at MDC. We are looking for temporary alternatives to make direct network connections which will be needed for supporting production loads.

We have the DELL and EXXACT systems fully configured for the Martinos Linux environment running CentOS8. An initial bare bones SLURM install has been done on the DELL nodes with one volunteer testing. We also have volunteers testing the CentOS8 install interactively on the EXXACT systems to find any gotchas from the move from CentOS7.

The A100’s are running Ubuntu 18.04 instead of CentOS8 as required by NVIDIA. We are mostly done with integrating this into the Martinos Linux domain and having some initial tests. It is clear that in many cases analysis programs will have to be separate installed and compiled for use on the A100 and one cannot just run the same binaries on both. In order to keep consistancy on the GPU enabled nodes we are considering replacing CentOS8 with Ubuntu 18.04 on the EXXACT servers.

POLICIES STILL BEING DISCUSSED

Lots of policy decisions still need to be made by the steering committee:

How to apportion and charge for the VAST storage.
How to control access to submitting jobs on the cluster. Do we charge for access? Or charge for time/resources used? By user or by project/lab?
How do we fairly schedule jobs submitted by users.
How do we handle “deadline emergencies” from PIs who need to horde all resources for a short immediate timeframe.
What maximum time limit per job do we allow.
Do we apportion some nodes to be dedicated to “quick” queue with small max time limits
…

APPENDIX

How SLURM changes things

We plan to make several changes compared to our old Launchpad cluster for this new MLSC cluster. Most significant is that we will be using the SLURM batch submission engine instead of TORQUE/PBS. Besides having a very different command set for submitting, monitoring and modifying jobs (e.g. see the Harvard FAS SLURM Help page), there will be significant change to how job queue priority will work.

First let me recap how job queue priority works on Launchpad. On Launchpad the batch manager has no concept of groups or projects or accounts — only users. Each user on Launchpad is equal with very few exceptions. Each submission queue has a base priority with which a job enters the idle queue of the scheduler. In general, queues with lower allowed max running jobs or time get higher base starting priority.

Most importantly each user is allowed only 8 cpu slots to be in the schedule idle queue at any one time. That could one 8 CPU slot job or eight 1 CPU slot jobs for example. Until a job enters the scheduler idle queue it cannot be considered for running on a node. It also cannot accrue priority beyond its submission queue’s base priority. Jobs in the idle queue, if they are not run immediately because resources are not available, will set there and accrue priority over time.

The next job to be run will always be the highest priority job in the scheduler idle queue. If it needs 8 CPU slots and thus a whole free node, everything will wait till a node becomes available with 8 free slots before any other job could run, even a 1 CPU job right behind it in priority when there are 50+ nodes with 1 free slot.

So on Launchpad, job priority is purely user-based and the only thing that affects priority are the base priority of the submission queue plus the time a job has spent in the schedule idle queue. The batch manager does not take into account that UserA might have used 10,000 CPU hours last week and UserB used zero. If they both submit 100 new jobs at the same time, the priorities of those jobs will be treated equal. Also being purely user-based, groups with more users get the ability to run more jobs being able to have 8 x #users in the scheduler idle queue at once.

With SLURM on the MLSC cluster one plan we are considering for the priority system is very different. It will be group based with priority determined from something called “fairshare” that modifies each jobs priority by the CPU/GPU/memory resources used by the group on the cluster in the previous weeks. Basically groups that are heavy users of the system will have their job priority decreased to give a better chance for groups have used recently used the cluster less to run their jobs. We plan to model our system very closely to the way Harvard FAS runs their cluster which is described in a lot of detail on this page.

If you have any questions or comments on this change, please discuss on the Martinos Core IT Channel.

Last Updated on September 14, 2020 by Paul Raines