We are finding the NVIDIA Tensorflow optimized Docker images from NGC are outperforming the tensorflow-gpu packages one typically installs using Conda. These Docker images can be run under Singularity on the MLSC cluster GPU partitions or even on your own workstations that have an NVIDIA GPU.
ALSO READ: Docker/Singularity at the Martinos Center
GETTING NGC IMAGES
One can get the Tensorflow images as well as PyTorch images and others from the NGC container website. The direct link to Tensorflow’s NGC page is here. Be aware that some images are behind a forced registration wall. For more information on using NVIDIA NGC see the NGC Manual site.
The Martinos IT group has downloaded some versions of the TensorFlow images that you can use without having to pull them yourself at /cluster/batch/IMAGES.
A good overview of using Singularity to pull and run NGC docker images is found at this guide.
One key issue is the pull will fill up your home directory quota unless you define the SINGULARITY_TMPDIR and SINGULARITY_CACHEDIR environment variables to point to storage space with plenty of room (>30GB free). For example:
$ mkdir -p /scratch/raines/singularity/{tmp,cache} $ export SINGULARITY_TMPDIR=/scratch/raines/singularity/tmp $ export SINGULARITY_CACHEDIR=/scratch/raines/singularity/cache $ cd /cluster/batch/IMAGES $ singularity pull tensorflow-19.11-tf1-py3.sif \ docker://nvcr.io/nvidia/tensorflow:19.11-tf1-py3
SINGULARITY_TMPDIR and SINGULARITY_CACHEDIR environment variables are automatically set to appropriate scratch dirs when in an MLSC job.
For pulling images that require your NGC account registration access, see this page
RUNNING IMAGES WITH SINGULARITY
An overview of using Singularity can be found at the Singularity Quick Start page.
Users should test running with Singularity on one of their group workstations or on the interactive server icepuff1.
In the following example we will use a TensorFlow 2.x images at /cluster/batch/IMAGES/tensorflow-20.12-tf2-py3.sif to run a tf_cnn_benchmark.
$ cd /cluster/batch/raines $ export CUDA_VISIBLE_DEVICES=0,1,2,3 $ singularity run --nv -B $PWD:/mnt -B /scratch -B /autofs \ -B /space -B /cluster -B /vast -B /usr/pubsw \ /cluster/batch/IMAGES/tensorflow-20.12-tf2-py3.sif python \ /mnt/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \ --num_gpus=4 --batch_size=32 --model=vgg19 --data_format=NHWC
For NGC images, you need to use the run subcommand instead of the exec subcommand so that the RUNSCRIPT in the image is called first to setup various things needed for using the image. The –nv option enables Singularity support of NVIDIA GPUs.
The -B option is used to mount directories from outside the container to inside the container. Remember that the image is a totally separate Linux virtual system (all the NGC images seem to be Ubuntu based). These bind mounts are the only way to really bridge between the image and the normal Martinos environment. You might finding using “-B /autofs -B /cluster -B /space -B /vast -B /usr/pubsw” useful to make the file environment inside the container close match the normal Martinos one. You should always include “-B /scratch” if it exists on the machine you are runnning on (it does for all MLSC cluster machines) or you may get errors about /scratch not existing.
Also be aware the current working directory when you start a container is by default your home directory so use absolute directory paths to your program and in the program’s arguments.
One big issue when using these images is when you need python modules not included in the python install embedded in the image. It is very difficult to modify the image itself to add customizations like additional python modules. Instead it is easier to set a PYTHONPATH via the “/mnt” bind mount to install the extra python modules you need. But make sure it does not override the tensorflow install in the image. Therefore DO NOT simply add your Anaconda install path to the PYTHONPATH.
WARNING: if you have activated an Anaconda environment in your shell, do not run Singularity from that shell as the environment settings Anaconda makes can interfere with the python running in the image. Open a new shell where Anaconda has not been sourced.
Yes, this can be really complicated. Here is an example of how we install an extra python modules needed and then run the analysis that uses it.
$ cd /cluster/itgroup/raines $ mkdir -p local/lib $ vi vars.txt #create it with your favorite editor (emacs, pico) $ cat vars.txt PYTHONUSERBASE=/mnt/local PYTHONPATH=$PYTHONUSERBASE/lib/python3.7/site-packages PATH=$PYTHONUSERBASE/bin:$PATH $ singularity run --nv --env-file vars.txt -B $PWD:/mnt -B /scratch:/scratch \ -B /autofs -B /cluster -B /space -B /vast \ /cluster/batch/IMAGES/tensorflow-20.12-tf2-py3.sif \ pip3 install nibabel $ singularity run --nv --env-file vars.txt -B $PWD:/mnt -B /scratch:/scratch \ -B /autofs -B /cluster -B /space -B /vast \ /cluster/batch/IMAGES/tensorflow-20.12-tf2-py3.sif \ python3 script_needing_nibabel_and_TF.py
We use the –env-file option to point at a file of environments variables to set/override from the shell environment when copied into the container environment. You obviously only need to create vars.txt and do the pip3 install one time so in the future you just have to
$ cd /cluster/itgroup/raines $ singularity run --nv --env-file vars.txt -B $PWD:/mnt -B /scratch:/scratch \ -B /autofs -B /cluster -B /space -B /vast \ /cluster/batch/IMAGES/tensorflow-20.12-tf2-py3.sif \ python3 script_needing_nibabel_and_TF.py
SUBMITTING IMAGES TO RUN ON MLSC CLUSTER
Make sure to test running your container first on your own Linux workstations with a GPU or in a MLSC interactive job. Here is an example of submitting an image to run on the MLSC Cluster in batch mode:
$ cd /cluster/itgroup/raines $ jobsubmit -p rtx6000 -m 64G -t 03:00:00 -A sysadm -c 4 --gpus 2 -- \ singularity run --nv --env-file vars.txt -B $PWD:/mnt -B /scratch:/scratch \ -B /autofs -B /cluster -B /space -B /vast \ /cluster/batch/IMAGES/tensorflow-20.12-tf2-py3.sif python \ /mnt/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \ --model=resnet50 --data_format=NCHW --batch_size=256 \ --num_gpus=2 --num_epochs=90 --optimizer=momentum \ --variable_update=replicated --weight_decay=1e-4
Note if the script you are running requires a GPU number argument like the example above it must match the request GPU argument given to jobsubmit.
Last Updated on November 27, 2023 by Paul Raines