In this section

Department of Computing GPU Cluster Guide

Getting Started

Please note: this service is intended primarily for supporting coursework and individual projects for taught programmes in the Department of Computing. Researchers and members of other departments may want to consult the Research Computing Services (RCS) for college-provided compute resources.

Update 19/12/24
Please note: as of Thursday 19th December 2024, offsite/external password-based SSH authentication is not possible on the head node servers. Logging in from a Doc Lab PC works as before. Please visit the Shell Server guide on how to set up public / private key authentication (you must be onsite at Imperial in order to create SSH keys for external remote access)

What is Slurm and the GPU Cluster?

Slurm is a Linux open-source task scheduling system for managing compute resources, in this case, the department's GPU resources.

Using Slurm commands such as 'sbatch' and 'salloc', your scripts (such as CUDA-based parallel computing - deep-learning, machine-learning and large language models (LLMs), using frameworks such as PyTorch and Tensorflow, or Jax, among others) are executed on our pool of NVIDIA GPU Linux servers.

Read this guide to learn how to:

connect to the submission host server and submit a test script
start an interactive job (connect directly to a GPU exclusively for a time limit)
compose a shell script that uses shared storage, a python environment, CUDA and your python scripts

Before you start

Some familiarity with Department of Computing systems is desirable before using the GPU cluster:

logging in to DoC Lab PCs, especially Nvidia GPU-equipped PCs (Doc Lab PCs)
remotely connecting to lab PCs and Doc Shell servers from a Linux/Mac/Windows Terminal (Shell server guide)
composing bash scripts (examples are provided below - it is beyond the scope of this guide to explain shell scripting)
python environments (Python environments guide)
Linux command line interface (Terminal, CLI)

Tip: make sure you have tested your python scripts on your own device or a Doc Lab PC with GPU before using the GPU cluster. Prior testing will help flag errors with your scripts before using sbatch

Follow Nuri's guide for an introduction to using Linux in the Department of Computing

1a. Quick Start (submit from a DoC Lab PC)

Open a Terminal window from a lab PC (Ubuntu/macOS, Windows 10/11 use Powershell built-in ssh or WSL/2), and type the following commands:

ssh gpucluster2.doc.ic.ac.uk
# or ssh gpucluster3.doc.ic.ac.uk
sbatch /vol/bitbucket/shared/slurmseg.sh

In this example, a user first logs into a Slurm head node (gpucluster2.doc.ic.ac.uk via ssh) and then submits a pre-existing script using the sbatch command. The output will be stored, by default, in the user's ~/ home directory, with the filename slurm20-{xyz}.out (or from whichever directory the user happens to run an sbatch command from)

If you have a bash script ready, replace /vol/bitbucket/shared/slurmseg.sh with the full path to your own script

1b. Quick Start (externally from a personal device)

Please note: as of Thursday 19th December 2024, offsite/external password-based SSH authentication is not possible on the head node servers. Logging in from a Doc Lab PC works as before. Please visit the Shell Server guide on how to set up public / private key authentication (you must be onsite at Imperial in order to create SSH keys for external remote access)

If connecting from your own personal computer or device, make sure you have set up your SSH keys:

ssh gpucluster2.doc.ic.ac.uk

sbatch /vol/bitbucket/shared/slurmseg.sh

gpucluster2.doc.ic.ac.uk and gpucluster3.doc.ic.ac.uk are now accessible from outside the college network, but if for some reason they are not accessible, use shell[1-5].doc.ic.ac.uk as a JumpHost:

ssh -J shell5.doc.ic.ac.uk gpucluster2.doc.ic.ac.uk

1c. Quick Start (interactive shell using 'salloc')

2. Store your datasets under /vol/bitbucket

3. Creation of a Python virtual environment for your project (example)

Please note: Use a lab PC to prepare your Python environment, avoid running 'pip' or 'git' commands when logged in to gpucluster2.doc.ic.ac.uk or gpucluster3.doc.ic.ac.uk or you may encounter 'out of space' errors. For further guidance, consult the Python virtual environment guide

4. Using CUDA (add to a script)

5. Example submission script

6. Connect to a submission host to send jobs

6b. GPU types

Partition name (taught/research)	GPU	CPU
gpgpu / resgpu	- Tesla A40 48GB	AMD Epyc
gpgpuB / resgpuB	- Tesla A30 12GB Mig Devices (from Jan 2025)	AMD Epyc
gpgpuC / resgpuC	- Tesla T4 16GB GPUs	Intel
gpgpuD /resgpuD	- Tesla T4 16GB GPUs	Intel
gpgpuM (taught only)	- Tesla A100 10GB Mig Devices	AMD Epyc 7/9
a16gpu / a16resgpu	- Tesla A16 16GB GPUs	AMD Epyc 7

Frequently Asked Questions

What GPU cards are installed on the GPU hosts?
Answer: Nvidia Tesla A30 (24GB RAM split into 12GB instances), Tesla T4 (16GB RAM), Tesla A40 (48GB RAM) and Tesla A100 (80GB split into 10GB instances)
What are the general platform characteristics of the GPU hosts?
Answer: 24-core/48 thread Intel Xeon CPUs with 256GB RAM and AMD EPYC 7702P 64-Core CPUs
How do I see what Slurm jobs are running?
Answer: invoke any one of the following commands on gpucluster:

# List all your current Slurm jobs in brief format
squeue
# List all your current Slurm jobs in extended format.
squeue -l

Please run man squeue on gpucluster for additional information.
How do I delete a Slurm job?
Answer: First, run squeue to get the Slurm job ID from the JOBID column, then run:

scancel <job ID>

You can only delete your own Slurm jobs.
How many GPU hosts are there?
Answer: As of July 2023, there are nine host GPU servers, with eight running DoC Cloud GPU nodes.
How do I analyse a specific error in the Slurm output file/e-mail after running a Slurm job?
Answer: If the reason for the error is not apparent from your job’s output, make a post on the Edstem CSG board , including all relevant information – for example:
- the context of the Slurm command that you are running. That is, what are you trying to achieve and how have you gone about achieving it? Have you, created a Python virtual environment? Are you using a particular server or deep learning framework?
- the Slurm script/command that you have used to submit the job. Please include the full paths to the scripts if they live under /vol/bitbucket
- what you believe should be the expected output.
- the details of any error message displayed. You would be surprised at how many forget to include this.
I receive no output from a Slurm job. How do I go about debugging that?
Answer: This is an open-ended question. Please first confirm that your Slurm job does indeed generate output when run interactively. You may be able to use one of the 'gpu01-36' interactive lab computers to perform an interactive test. If you still need assistance, please follow the advice in the preceding FAQ entry (Number vi).
How do I customise my job submission options?
Answer: Add a Slurm comment directive to your job script – for example:

# To request 1 or more GPUs (default is 1):
#SBATCH --gres=gpu:1

# To request a 48GB Tesla A40 GPU:
#SBATCH --partition gpgpu
# or 80GB A100 GPU
#SBATCH --partition AMD7-A100-T
# Please note, there are only a few 48GB/80GB GPUs available, interactive jobs are not permitted
# For other GPUs, refer to 6b. GPU types, including the research equivalents of the above

# To receive email notifications
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your_username>

#Customise job output name
#SBATCH --output=<your_job_name>%j.out
How do I run a job interactively?
Answer: Use srun and specify a gpu, and other resources. eg. for a bash shell:

srun --pty --gres=gpu:1 bash

Update: use 'salloc' as detailed in Step 2c
I need a particular software package to be installed on a GPU host.
Answer: Have you first tried installing the package in a Python virtual environment or in your own home directory with the command:

pip install --user <packagename>

If the above options do not work then make a post on the Edstem CSG board with details of the package that you would like to be installed on the GPU server(s). Please note: CSG are only able to install standard Ubuntu packages if doing so does not conflict with any exisiting package or functionality on all the GPU servers.
My job is stuck in queued status, what does this mean?
Answer: This could be because all GPUs are in use. PD status occurs if you are already running two jobs, and will run (R) when one of your previous tasks is complete. (QOSMaxGRESPerUser) means you are using your maximum of two GPUs at any one time.
What are the CUDA compute capabilities for each GPU?
Please consult the NVIDIA Compatiiblity Index for more information.
The cluster GPUs support the following levels:
sm75 (T4), sm80 (A30), sm86 (A40)
These should be considered when, for example, using older versions of Pytorch and receiving 'not supported' errors

Useful contacts

News
Events

News and events

Click here to discover what we've been up to and how you can get involved.

Department of Computing GPU Cluster Guide

Getting Started

What is Slurm and the GPU Cluster?

Before you start

Useful contacts

News and events

Faculty of Engineering

Get in touch

Quick links

Find us on social media

Department of Computing GPU Cluster Guide

Getting Started

Introduction

What is Slurm and the GPU Cluster?

Before you start

Step by step

General Comments

Useful contacts

News and events

Faculty of Engineering

Get in touch

Quick links

Find us on social media