Introduction
Advanced Research Computing (ARC) has a high-performance computing clusters for researchers at Virginia Tech. I have created this article for new users to ARC since when I started using ARC it was little tough as I was new to many things such as slurm scheduling, ARC modules, etc. The information on ARC documentation pages is very scattered and is hard to find exactly what you're looking for sometimes. So, I have created this article for everyone who wants to train their first deep learning model on ARC with PyTorch.
GPU clusters at VT (More info here):
- Cascades: old server, V100 GPUs, many are now moved to Infer (More info here)
- Huckleberry: old server, NVIDIA Tesla P100 GPUs (More info here)
- Infer: NVIDIA Volta V100 GPUs, NVIDIA Tesla P100 GPUs, NVIDIA Tesla T4 GPUs (I use this for small-normal models)(More info here)
- TinkerCliffs: NVIDIA A100 GPUs (I use this for large models, I would recommend using Infer V100 GPUs for small models requiring
< 16GB GPU memory since
a100_normal_q
is mostly busy and waiting is higher.)(More info here)
Logging in
You can log in to any of the servers using ssh.
ssh <username>@tinkercliffs1.arc.vt.edu
ssh <username>@tinkercliffs2.arc.vt.edu
ssh <username>@infer1.arc.vt.edu
ssh <username>@infer2.arc.vt.edu
Storage
You can read about the storage options you have here. You should have got access to some storage in
/projects
which is accessible
from both Infer & TinkerCliffs clusters. I generally store my code on some small stuff in my /home
directory and the large datasets are stored in /projects
Job Submission
ARC uses slurm for scheduling. So you need to create a job file as given on cascades page here. You can also check out this example job file I have created to run a pyuthon code on GPU.
Slurm queueing
There are different queues on which we can submit our jobs depending upon the job requirements. Here are the ones I use frequently:
Queue | Resources | Cluster |
---|---|---|
v100_normal_q |
V100 GPUs | Infer, Cascades |
a100_normal_q |
A100 GPUs | TinkerCliffs |
normal_q |
CPU queues | All clusters |
dev_q
eg. v100_dev_q
which can
let you access the queue for short amout of time and generally has less (mostly none) waiting time. There are more GPU queues on Infer as well.
Slurm queuing details are given on this page.
Here are some of those important commands related to slurm queing:
- Submit job:
sbatch ./file_name.sh
- Check job status:
squeue -j <job_number>
(You'll get the job number after submitting the job.) - Check queue status:
squeue -p <queue_name>
<queue_name> example would bev100_normal_q
on Infer - Check jobs submitted by particular user on current cluster:
squeue -u <username>
Slurm-<job_number>.out
as the same location from where you submitted the job.
Interactive session
Interactive session is mostly used for development or debugging. You can try interactive session on gpu by queuing to
dev_q
for upto 2-4 hours or normal_q
for more time.
Command for interactive session.
-
interact -A <allocation_name> -t 2:00:00 -p v100_dev_q -n 1 --gres=gpu:1 --mem=50G
Here I am requesting for interactive session for 2 hours onv100_dev_q
with 1 gpu & 50GB memory -
interact -A virtual_presenter -t 10:00:00 -p normal_q --ntasks=200 --mem-per-cpu=10G
Here I am requesting for interactive session for 10 hours onnormal_q
(CPU queue) with 200 cores and 10GB per cpu core.
Modules
To say in simple words, modules are basically applications/services which you can load to use.
Commands related to modules (The available modules may vary from server to server, but here's what I use on TinkerCliffs):
- See all available modules:
module avail
- Purge:
module purge
- Unload all modules
module reset
- Load Anaconda:
module load Anaconda/2020.11
(I use this one, I believe it's the latest available there) - Load FFmpeg:
module load FFmpeg
Process for running PyTorch code:
- Log in to a cluster.
- Load Anaconda Module.
- Create conda environment (You must know this).
- Start interactive session in GPU based
dev_q
so that appropriate torch version is installed. - Load cuda module if required (It is already loaded in TinkerCliffs, I am not sure about Infer).
- Install torch in the environment & other requirements.
- Create a job file as given above (change as per your requirements, and load proper modules, no need to load cuda if it already loaded when you tried in interactive session)
- Submit the job file.
juvekaradheesh at vt dot edu