Getting Started
Please note: this service is intended primarily for supporting coursework and individual projects for taught programmes in the Department of Computing. Researchers and members of other departments may want to consult the Research Computing Services (RCS) for college-provided compute resources.
![]() |
Update 19/12/24 |
---|
Please note: as of Thursday 19th December 2024, offsite/external password-based SSH authentication is not possible on the head node servers. Logging in from a Doc Lab PC works as before. Please visit the Shell Server guide on how to set up public / private key authentication (you must be onsite at Imperial in order to create SSH keys for external remote access) |
Introduction
What is Slurm and the GPU Cluster?
Slurm is a Linux open-source task scheduling system for managing compute resources, in this case, the department's GPU resources.
Using Slurm commands such as 'sbatch' and 'salloc', your scripts (such as CUDA-based parallel computing - deep-learning, machine-learning and large language models (LLMs), using frameworks such as PyTorch and Tensorflow, or Jax, among others) are executed on our pool of NVIDIA GPU Linux servers.
Read this guide to learn how to:
- connect to the submission host server and submit a test script
- start an interactive job (connect directly to a GPU exclusively for a time limit)
- compose a shell script that uses shared storage, a python environment, CUDA and your python scripts
Before you start
Some familiarity with Department of Computing systems is desirable before using the GPU cluster:
|
Tip: make sure you have tested your python scripts on your own device or a Doc Lab PC with GPU before using the GPU cluster. Prior testing will help flag errors with your scripts before using sbatch |
Follow Nuri's guide for an introduction to using Linux in the Department of Computing |
Step by step
- 1a. Quick Start (submit from a DoC Lab PC)
- 1b. Quick Start (externally from a personal device)
- 1c. Quick Start (interactive shell using 'salloc')
- 2. Store your datasets under /vol/bitbucket
- 3. Creation of a Python virtual environment for your project (example)
- 4. Using CUDA (add to a script)
- 5. Example submission script
- 6. Connect to a submission host to send jobs
- 6b. GPU types
- Frequently Asked Questions
General Comments
The following policies are in effect on the GPU Cluster:
- User can have two running jobs only (taught students), all other jobs will be queued until one of the two jobs completes running
- A job that runs for more than four days will be automatically terminated - this is a walltime restriction for taught students - configure checkpoints with your python framework to resume training.
- As with all departmental resources, any non-academic use of the GPU cluster is strictly prohibited.
- Any users who violate this policy will be banned from further usage of the cluster and will be reported to the appropriate departmental and college authorities.