Running PyTorch on VT ARC GPU: From ARC Basics to Running your code on GPU

date:August 26, 2022

Introduction

Advanced Research Computing (ARC) has a high-performance computing clusters for researchers at Virginia Tech. I have created this article for new users to ARC since when I started using ARC it was little tough as I was new to many things such as slurm scheduling, ARC modules, etc. The information on ARC documentation pages is very scattered and is hard to find exactly what you're looking for sometimes. So, I have created this article for everyone who wants to train their first deep learning model on ARC with PyTorch.
GPU clusters at VT (More info here):

Cascades: old server, V100 GPUs, many are now moved to Infer (More info here)
Huckleberry: old server, NVIDIA Tesla P100 GPUs (More info here)
Infer: NVIDIA Volta V100 GPUs, NVIDIA Tesla P100 GPUs, NVIDIA Tesla T4 GPUs (I use this for small-normal models)(More info here)
TinkerCliffs: NVIDIA A100 GPUs (I use this for large models, I would recommend using Infer V100 GPUs for small models requiring < 16GB GPU memory since a100_normal_q is mostly busy and waiting is higher.)(More info here)

Logging in

You can log in to any of the servers using ssh.

ssh <username>@tinkercliffs1.arc.vt.edu
ssh <username>@tinkercliffs2.arc.vt.edu
ssh <username>@infer1.arc.vt.edu
ssh <username>@infer2.arc.vt.edu

Storage

You can read about the storage options you have here. You should have got access to some storage in /projects which is accessible from both Infer & TinkerCliffs clusters. I generally store my code on some small stuff in my /home directory and the large datasets are stored in /projects

Job Submission

ARC uses slurm for scheduling. So you need to create a job file as given on cascades page here. You can also check out this example job file I have created to run a pyuthon code on GPU.

Slurm queueing

There are different queues on which we can submit our jobs depending upon the job requirements. Here are the ones I use frequently:

Queue	Resources	Cluster
`v100_normal_q`	V100 GPUs	Infer, Cascades
`a100_normal_q`	A100 GPUs	TinkerCliffs
`normal_q`	CPU queues	All clusters

Other limitations like max-time, max-cores, max-gpus, etc. are given on each server page on which the queue exists. For most of the queuest there exists dev_q eg. v100_dev_q which can let you access the queue for short amout of time and generally has less (mostly none) waiting time. There are more GPU queues on Infer as well.

Slurm queuing details are given on this page.

Here are some of those important commands related to slurm queing:

Submit job: sbatch ./file_name.sh
Check job status: squeue -j <job_number> (You'll get the job number after submitting the job.)
Check queue status: squeue -p <queue_name> <queue_name> example would be v100_normal_q on Infer
Check jobs submitted by particular user on current cluster: squeue -u <username>

The output of the jobs are saved in a file named Slurm-<job_number>.out as the same location from where you submitted the job.

Interactive session

Interactive session is mostly used for development or debugging. You can try interactive session on gpu by queuing to dev_q for upto 2-4 hours or normal_q for more time.
Command for interactive session.

interact -A <allocation_name> -t 2:00:00 -p v100_dev_q -n 1 --gres=gpu:1 --mem=50G
Here I am requesting for interactive session for 2 hours on v100_dev_q with 1 gpu & 50GB memory
interact -A virtual_presenter -t 10:00:00 -p normal_q --ntasks=200 --mem-per-cpu=10G
Here I am requesting for interactive session for 10 hours on normal_q (CPU queue) with 200 cores and 10GB per cpu core.

To check your allocation name and all login to coldfront.arc.vt.edu.

Modules

To say in simple words, modules are basically applications/services which you can load to use.
Commands related to modules (The available modules may vary from server to server, but here's what I use on TinkerCliffs):

See all available modules: module avail
Purge: module purge
Unload all modules module reset
Load Anaconda: module load Anaconda/2020.11 (I use this one, I believe it's the latest available there)
Load FFmpeg: module load FFmpeg

Process for running PyTorch code:

Log in to a cluster.
Load Anaconda Module.
Create conda environment (You must know this).
Start interactive session in GPU based dev_q so that appropriate torch version is installed.
Load cuda module if required (It is already loaded in TinkerCliffs, I am not sure about Infer).
Install torch in the environment & other requirements.
Create a job file as given above (change as per your requirements, and load proper modules, no need to load cuda if it already loaded when you tried in interactive session)
Submit the job file.

Hope this article helped you getting started on ARC faster. If you have any doubts regarding ARC, you can join the ARC office hours. They are really helpful and have helped me solve my queries on multiple occasions. If you have any doubts regarding this article, you can email me at juvekaradheesh at vt dot edu