Session 02
Date: Wednesday, Jan 21
Entry ticket: T01 and T02
Learning Objectives
- Differentiate between Login Nodes and Compute Nodes to prevent system overload.
- Distinguish between Home directories (
/ihome) and Project storage (/ix) based on their distinct backup policies and quotas. - Manage software versions and dependencies using Lmod commands rather than installing software locally.
- Configure SSH keys to enable secure, password-less authentication to the cluster.
- Execute computational tasks by submitting SLURM requests for both Interactive sessions and Batch scripts.
Concept Check
Imagine you are studying a protein simulation. You write a single Python script that does the following in one go:
- Reads the massive raw text log.
- Filters out any time steps where the protein’s ‘Total Energy’ is positive.
- Calculates the average stability of the remaining steps.
- Plots the result.
You show the plot to your PI. They ask: ‘Wait, are we sure those positive energy states were errors? What if they were actually rare transition states? Show me the graph with those points included.’ Why is your current ‘Single Script’ architecture a disaster for answering this question?
Code Along
Remote Computing
Scientific computing often requires resources far beyond what a personal laptop can provide. Remote computing is the practice of accessing powerful, centralized computers over a network to perform these tasks. Instead of running a simulation on your physical machine, you send the instructions to a remote machine, or “host,” which executes the work and saves the results.
To navigate this environment effectively, we must define the standard architecture of a High-Performance Computing (HPC) cluster.
The Client-Server Model
The relationship between your computer and the cluster is a Client-Server model.
- Client: Your personal computer (laptop or desktop). It initiates the connection.
- Server (Host): The remote computer at the data center. It listens for connections and provides resources.
- Network: The infrastructure connecting them (Internet/VPN).
Architecture of a Cluster
An HPC cluster is not a single supercomputer; it is a collection of hundreds of individual computers (called nodes) connected by high-speed networks. These nodes are specialized into different roles to maintain efficiency and security.
flowchart LR
subgraph Local["Local Machine (Client)"]
Laptop[Your Laptop]
end
subgraph Cluster["HPC Cluster (Server Side)"]
direction TB
Login[<b>Login Node</b><br/>Gateway & Submission]
subgraph Compute["Compute Farm"]
CN1[Compute Node 1]
CN2[Compute Node 2]
CN3[Compute Node 3]
CN4[Compute Node 4]
end
Storage[(<b>Shared Storage</b><br/>Home & Scratch)]
end
Laptop -- "SSH (Port 22)" --> Login
Login -- "sbatch / srun" --> CN1
Login -- "sbatch / srun" --> CN2
Login -- "sbatch / srun" --> CN3
Login -- "sbatch / srun" --> CN4
%% Storage connections
Login === Storage
CN1 === Storage
CN2 === Storage
CN3 === Storage
CN4 === Storage
classDef plain fill:#fff,stroke:#333,stroke-width:1px;
classDef important fill:#a2d2ff,stroke:#333,stroke-width:2px;
class Laptop,CN1,CN2,CN3,CN4 plain;
class Login,Storage important;
Login Nodes
When you connect to the cluster, you land on a Login Node. Think of this as the reception desk of a factory.
This is where you edit files, compile code, organize data, and submit jobs to the scheduler. It is public-facing (usually requires VPN) and accessible via SSH. Never run heavy computations here. Login nodes are shared by hundreds of users simultaneously. If one user maxes out the CPU on the login node, the system becomes sluggish for everyone else.
Compute Nodes
The actual heavy lifting happens on the Compute Nodes.
These nodes are optimized for raw processing power. They execute the simulations and analyses you requested. You generally do not log into these directly. Instead, you ask the scheduler (SLURM) to allocate one for you. When you are assigned a compute node, you often have exclusive or semi-exclusive access to its resources, ensuring your calculations run as fast as possible without interference.
Shared Storage
The CRCD operates several distinct file systems, each optimized for specific purposes such as configuration, active computation, or long-term archiving. Because the compute nodes and login nodes share these file systems, you can access your data from anywhere in the cluster.
flowchart TD
subgraph Root["/ (Root File System)"]
direction TB
subgraph Home["/ihome (Personal Configs)"]
U1[User Home: /ihome/group/user]
end
subgraph Project["/ix & /ix1 (Project Data)"]
P1[Group Project: /ix/group]
P2[Group Project: /ix1/group]
end
end
User((User)) --> U1
User --> P1
User --> P2
style Home fill:#e1f5fe,stroke:#01579b
style Project fill:#e8f5e9,stroke:#2e7d32
File System Summary
| Location | Purpose | Quota | Backups | Snapshots |
|---|---|---|---|---|
/ihome | Home Directories. Small config files, environments, and scripts. | 75 GB (Fixed) | Daily | Yes |
/ix & /ix1 | Project Storage. The primary location for active datasets and job staging. | 5 TB (Expandable) | None | 7 Days |
Storage Details
Home Directories (/ihome)
This is your entry point when logging into the system.
It is organized by user group: /ihome/<primary_group>/<username>.
It is best for Source code, Conda/Python environments, and Jupyter session logs. Each user only has 75 GB available. This quota is strict and cannot be increased. This is the only location that is backed up daily.
Project Data (/ix and /ix1)
These are enterprise storage locations designed for persistent file storage and heavy workloads.
This is best for staging input data for compute jobs and storing results.
By default, only your PI and group members can access your folder here (/ix/groupname).
Groups receive a 5 TB allocation at no cost.
Additional storage can be purchased ($85/TB/year).
While there are no long-term backups, 7-day snapshots are available to restore accidentally deleted files.
Warning
No Backups on Project Storage
The /ix and /ix1 file systems are not backed up.
If you delete a file and wait longer than 7 days, it is gone forever.
Always keep copies of critical raw data on separate local storage (e.g., your lab’s OneDrive or a physical hard drive).
Managing Your Storage
Checking Quotas (crc-quota)
To check how close you are to your storage limits across all file systems, use the crc-quota wrapper command:
[amm503@login0 ~]$ crc-quota
User: 'amm503'
-> ihome: 41.43 GB / 75.0 GB
Group: 'jdurrant'
-> ix: 23.5 TB / 29.0 TBFile Permissions & Sharing
Linux uses POSIX permissions to control who can Read, Write, or eXecute files.
By default, your files in /ix are visible to your group members but not to outsiders.
- View permissions:
ls -l - Change permissions:
chmod(e.g.,chmod g+rx filenamegives your group read/execute access). - Default permissions: Determined by
umask. A umask of007allows full group access, while077restricts everything to the owner only.
If you need to share data with a collaborator in a different group, standard permissions are often insufficient. The CRCD recommends using Access Control Lists (ACLs) for this purpose.
Do not try to configure complex ACLs manually if you are unsure.
Submit a Help Ticket requesting a shared folder.
The admins will create a dedicated directory (e.g., /ix/group/shared/folder) with the correct nfs4_setfacl permissions applied for your collaborator.
Software Modules (Lmod)
The CRCD clusters host thousands of software packages, libraries, and compilers.
Installing all of these into the standard system paths (like /usr/bin) would be chaotic: different users need different versions of the same software (e.g., Python 3.8 vs. Python 3.11), and many applications have conflicting dependencies.
To manage this complexity, we use Lmod, a Lua-based environment module system.
Lmod allows you to dynamically modify your user environment to access specific software versions on demand.
Think of the system as a library: the books (software) are stored in the stacks, and you must check them out (module load) to use them.
How Modules Work
When you log in, your environment is relatively “clean,” containing only basic system tools.
When you load a module, Lmod silently prepends paths to your environment variables (specifically $PATH and $LD_LIBRARY_PATH).
Essential Commands
The module command is your primary interface for managing software.
| Command | Description |
|---|---|
module spider <name> | Search. Finds all versions of a software package, even if they aren’t currently loadable. |
module load <name>/<ver> | Load. Adds the specific version of the software to your environment. |
module list | Check. Shows all currently loaded modules. |
module purge | Clean. Unloads all modules, returning your environment to a default state. |
module avail | Browse. Lists only the modules that are compatible with your currently loaded compilers/libraries. |
Finding Software (module spider)
New users often try module avail to find software, but this command only shows what is immediately available.
Many packages on CRCD are built with specific compilers (like GCC or Intel) and are hidden until you load that compiler.
To find software, always use module spider.
It searches the entire database.
[amm503@login0 ~]$ module spider python
-----------------------------------------------------------------------------------------------
python:
-----------------------------------------------------------------------------------------------
Versions:
python/ondemand-jupyter-python3.8
python/ondemand-jupyter-python3.9
python/ondemand-jupyter-python3.11
python/py37_venv_23.1.0
python/pytorch_251_311_cu124
python/tensorflow_218_311
python/3.7.0
python/3.7.17
python/3.8.18
python/3.8.20-orhs6eu
python/3.9.18
python/3.10.13
python/3.11.6
python/3.11.9
python/3.11.11-fayknjn
python/3.12.0
python/3.12.8
Other possible modules matches:
openslide-python py-biopython py-bx-python py-gitpython py-ipython py-ipython-genutils ...
-----------------------------------------------------------------------------------------------
To find other possible module matches execute:
$ module -r spider '.*python.*'
-----------------------------------------------------------------------------------------------
For detailed information about a specific "python" package (including how to load the modules) use the module's full name.
Note that names that have a trailing (E) are extensions provided by other modules.
For example:
$ module spider python/3.12.8
-----------------------------------------------------------------------------------------------If you run the specific command suggested in the output, Lmod will tell you exactly which dependencies you need to load first.
Loading Software (module load)
Once you know the module name, load it. It is best practice to always specify the version number explicitly. If you omit the version, Lmod will load the default (usually the latest), which may change unexpectedly and break your scripts in the future.
# Good Practice: Explicit versioning
module load gcc/12.2.0 python/3.11.4
# Risky Practice: Implicit defaults
module load gcc pythonWarning
Avoid module load in .bashrc
It is tempting to add module load commands to your ~/.bashrc file so your favorite tools are always ready.
Do not do this.
It can cause silent failures in batch jobs, interfere with the X2Go desktop, and break the Open OnDemand portal.
Always load modules inside your job scripts or interactive sessions.
Managing Conflicts (module purge)
Because different software chains can conflict (e.g., trying to load an Intel-compiled library while using the GCC compiler), it is good hygiene to clean your environment before starting a new task.
This is especially important in batch scripts (.slurm files) to ensure the job runs in a predictable environment.
# In a job script:
module purge # Clear everything
module load gcc/12.2.0 # Load only what is needed
module load openmpi/4.1.5Hardware
The CRCD operates several HPC clusters, each optimized for a different use case. Users can connect these systems using a shared pool of login nodes, which serve as the primary entry point for submitting jobs. Selecting the appropriate cluster ensures efficient resource use and reduces wait times in the job queue.
Login
Login nodes serve as the primary entry point for accessing CRCD systems. They provide secure command-line access over SSH, allowing users to submit jobs, manage files, and launch interactive tasks.
Login nodes are a shared resource that provide fast network access and responsive storage for tasks such as preparing data, compiling code, or submitting jobs. They are not intended for running large-scale analysis or compute-heavy tasks. Running intensive computations directly on login nodes can slow down systems for all users, and built-in limits on CPU, memory, and runtime may prevent larger jobs from completing successfully. For analysis and compute-intensive tasks, users should submit jobs to the compute nodes, which are designed to handle high-performance workloads.
| hostname 1 | backend hostname | Architecture | Cores/Node | Mem (Mem/Core) | OS Drive |
|---|---|---|---|---|---|
h2p.crc.pitt.edu | login0.crc.pitt.edu | Intel Xeon Gold 6326 (Ice Lake) | 32 | 256 GB (8 GB) | 2x 480 GB NVMe (RAID 1) |
login1.crc.pitt.edu | Intel Xeon Gold 6326 (Ice Lake) | 32 | 256 GB (8 GB) | 2x 480 GB NVMe (RAID 1) | |
htc.crc.pitt.edu | login3.crc.pitt.edu | Intel Xeon Gold 6326 (Ice Lake) | 32 | 256 GB (8 GB) | 2x 480 GB NVMe (RAID 1) |
SMP
The SMP cluster is designed for workloads that run on a single node using shared memory parallelism. Each node provides multiple CPU cores with access to a common memory space, making the cluster well-suited for multithreaded applications. OpenMP codes and jobs that do not require distributed computing across multiple nodes.
| Partition2 | Host Architecture | --constraint | Nodes | Cores/Node | Mem/Node (Mem/Core) | Scratch | Node Names | |
|---|---|---|---|---|---|---|---|---|
smp | AMD EPYC 9374F (Genoa) | amd, genoa | 43 | 64 | 768 GB (12 GB) | 3.2 TB NVMe | smp-n[214-256] | |
| AMD EPYC 7302 (Rome) | amd, rome | 55 | 32 | 256 GB (8 GB) | 1 TB SSD | smp-n[156-210] | ||
high-mem | Intel Xeon Platinum 8352Y (Ice Lake) | intel, ice_lake | 8 | 64 | 1 TB (16 GB) | 10 TB NVMe | smp-1024-n[1-8] | |
| Intel Xeon Platinum 8352Y (Ice Lake) | intel, ice_lake | 2 | 64 | 2 TB (32 GB) | 10 TB NVMe | smp-2048-n[0-1] | ||
| AMD EPYC 7351 (Naples) | amd, naples | 1 | 32 | 1 TB (32 GB) | 1 TB NVMe | smp-1024-n0 |
MPI
The MPI cluster is optimized to support parallel workloads running across many nodes at once. High-speed networking enables low-latency communication between processes, making the cluster ideal for tightly coupled codes that use message-passing interfaces or other distributed frameworks.
Jobs on the MPI cluster are allocated a minimum of 2 nodes. Users who regularly run single-node workloads should submit to the SMP cluster instead, which is specifically designed for single-node jobs. While running single-node jobs on MPI is allowed, resources allocated on the unused node(s) will still be counted against the user’s consumed service units.
| Partition | Host Architecture | Nodes | Cores/Node | Mem/Node (Mem/Core) | Scratch | Network | Node Names |
|---|---|---|---|---|---|---|---|
mpi | Intel Xeon Gold 6342 (Ice Lake) | 136 | 48 | 512 GB (10.6 GB) | 1.6 TB NVMe | HDR200; 10GbE | mpi-n[0-135] |
ndr | AMD EPYC 9575F | 18 | 128 | 1.5 TB (11.2 GB) | 2.9 TB NVMe | NDR200; 10GbE | mpi-n[136-153] |
opa-high-mem | Intel Xeon Gold 6132 (Skylake) | 36 | 28 | 192 GB (6.8 GB) | 500 TB SSD | OPA; 10GbE | opa-n[96-131] |
HTC
The HTC cluster is designed for data-intensive health science workflows such as genomics and neuroimaging. Jobs run on single nodes and are well-suited for high-throughput pipelines that process many independent tasks in parallel.
Resource allocation on HTC is prioritized for projects funded by the National Institute of Health (NIH). Non-NIH projects may also use the cluster, but users who are not running biomedical workloads or do not require hardware specific to the HTC cluster are encouraged to use the MPI or SMP clusters instead.
| Partition2 | Host Architecture | --constraint | Nodes | Cores/Node | Mem/Node (Mem/Core) | Scratch | Node Names |
|---|---|---|---|---|---|---|---|
htc | AMD EPYC 9374F (Genoa) | amd, genoa | 20 | 64 | 768 GB (12 GB) | 3.2 TB NVMe | htc-n[50-69] |
| Intel Xeon Platinum 8352Y (Ice Lake) | intel, ice_lake | 18 | 64 | 512 GB (8 GB) | 2 TB NVMe | htc-n[32-49] | |
| Intel Xeon Platinum 8352Y (Ice Lake) | intel, ice_lake | 4 | 64 | 1 TB (16 GB) | 2 TB NVMe | htc-1024-n[0-3] | |
| Intel Xeon Gold 6248R (Cascade Lake) | intel, cascade_lake | 8 | 48 | 768 GB (16 GB) | 960 GB SSD | htc-n[24-31] |
GPU
The GPU cluster is optimized for workloads requiring GPU acceleration, including machine learning, molecular dynamics simulations, and large-scale data analysis. The cluster supports CUDA, TensorFlow, PyTorch, and other GPU-accelerated frameworks. Users who do not require GPU resources are strongly encouraged to leverage the MPI or SMP clusters instead.
| Partition Name | Node Count | GPU Type | GPU/Node | --constraint | Host Architecture | Core/Node | Max Core/GPU | Mem/Node (Mem/Core) | Scratch | Network | Node Names |
|---|---|---|---|---|---|---|---|---|---|---|---|
l40s | 20 | L40S 48GB | 4 | l40s,48g,intel | Intel Xeon Platinum 8462Y+ | 64 | 16 | 512 GB (8 GB) | 7 TB NVMe | 10GbE | gpu-n[55-74] |
a100 | 10 | A100 40GB PCIe | 4 | a100,40g,amd | AMD EPYC 7742 (Rome) | 64 | 16 | 512 GB (8 GB) | 2 TB NVMe | HDR200; 10GbE | gpu-n[35-44] |
| 2 | A100 40GB PCIe | 4 | a100,40g,intel | Intel Xeon Gold 5220R (Cascade Lake) | 48 | 12 | 384 GB (8 GB) | 1 TB NVMe | 10GbE | gpu-n[33-34] | |
a100_multi | 10 | A100 40GB PCIe | 4 | a100,40g,amd | AMD EPYC 7742 (Rome) | 64 | 16 | 512 GB (8 GB) | 2 TB NVMe | HDR200; 10GbE | gpu-n[45-54] |
a100_nvlink | 2 | A100 80GB SXM | 8 | a100,80g,amd | AMD EPYC 7742 (Rome) | 128 | 16 | 1 TB (8 GB) | 2 TB NVMe | HDR200; 10GbE | gpu-n[31-32] |
| 3 | A100 40GB SXM | 8 | a100,40g,amd | AMD EPYC 7742 (Rome) | 128 | 16 | 1 TB (8 GB) | 12 TB NVMe | HDR200; 10GbE | gpu-n[28-30] |
TEACH
The Teach cluster provides dedicated resources for classroom instruction and coursework. It provides a stable environment for students to learn HPC concepts, run assignments, and develop workflows without competing with production research workloads for resources. The Teach cluster is not suitable for research work and should only be used to run jobs associated with classroom activities.
| Resource Type | Node Count | CPU Architecture | Core/Node | CPU Memory (GB) | GPU Card | No. GPU | GPU Memory (GB) |
|---|---|---|---|---|---|---|---|
CPU | 54 | Gold 6126 Skylake 12C 2.6GHz | 24 | 192 | N/A | N/A | N/A |
GPU 1 | 7 | E5-2620v3 Haswell 6C 2.4GHz | 12 | 128 | NVIDIA Titan X | 4 | 12 |
GPU 2 | 6 | E5-2620v3 Haswell 6C 2.5GHz | 12 | 128 | NVIDIA GTX 1080 | 4 | 8 |
GPU 3 | 10 | Xeon 4112 Skylake 4C 2.6GHz | 8 | 96 | NVIDIA GTX 1080 Ti | 4 | 11 |
GPU 4 | 2 | Xeon Platinum 8502+ 1.9GHz | 128 | 512 | NVIDIA L4 | 8 | 24 |
SSH Connection using a terminal
SSH (Secure Shell) is a network protocol that allows for secure access to a computer over an unsecured network. This is the protocol for connecting to the CRCD login nodes.
As with any other method for connecting to CRCD resources, you should start by ensuring you have a proper connection to the GlobalProtect VPN. With this connection established, you can proceed with the steps below.
Clients running Windows can use WSL, MobaXterm, PuTTY to access a terminal emulator. Clients running MacOS or Linux can use the built-in terminal app (in Applications/Utilities).
To render graphics from the remote session, you will also need an X server on your client.
Here are the connection details:
- Connection protocol:
ssh - Remote hostname:
h2p.crc.pitt.eduorhtc.crc.pitt.edu - Authentication credentials: Pitt username (all lowercase) and password
The syntax to connect to the CRCD login node from your terminal commandline is
ssh username@h2p.crc.pitt.eduwhere username is your Pitt username in lowercase and the answer to the prompt is the corresponding password.
Tip
If you enter the command and nothing happens for more than five seconds, that usually indicates you do not have an active GlobalProtect VPN connection.
You can use CTRL + C to send the SIGINT (i.e., the interrupt signal) to quit the command.
Once successful, you will see this splash screen.
###############################################################################
Welcome to h2p.crc.pitt.edu!
Documentation can be found at https://crc-pages.pitt.edu/user-manual/
-------------------------------------------------------------------------------
IMPORTANT REMINDERS
Don't run jobs on login nodes! Use interactive jobs: `crc-interactive --help`
Slurm is separated into 'clusters', e.g. if `scancel <jobnum>` doesn't work
try `crc-scancel <jobnum>`. Try `crc-sinfo` to see all clusters.
-------------------------------------------------------------------------------
###############################################################################SSH Keys
The aforementioned SSH connection steps will work, but they require human input in multiple steps. We can set up SSH keys and configurations to log into our cluster with just one short alias.
First, we have to generate a SSH key using ssh-keygen using Ed25519.
ssh-keygen -t ed25519It will ask you a few questions, such as where you want to save the file and if you want to add a password.
Generating public/private Ed25519 key pair.
Enter file in which to save the key (/Users/alex/.ssh/id_ed25519): /Users/alex/.ssh/id_crcd
Enter passphrase for "/Users/alex/.ssh/id_crcd" (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/alex/.ssh/id_crcd
Your public key has been saved in /Users/alex/.ssh/id_crcd.pub
The key fingerprint is:
SHA256:lHIXSnWohFhdcRPsLnAq+C0ZzCftrBJSDejfyMU6rig alex@seasoned.local
The key's randomart image is:
+--[ED25519 256]--+
| . o.o.o=+=. |
| . o ..oo.+.. |
| . +..=... |
| . . ++o.. . |
| + O .S+ . |
| . O B + . . |
| o + X . |
|E. o + + |
|o .. ..o |
+----[SHA256]-----+You cannot use ~ in your file path, and we recommend naming the key for the specific use case.
Your private key will look something like this.
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
QyNTUxOQAAACBtbFTCwuSex8qyTeNKF9dILx+bDkAaH72Z5rbQUQh2KgAAAJg+kmYSPpJm
EgAAAAtzc2gtZWQyNTUxOQAAACBtbFTCwuSex8qyTeNKF9dILx+bDkAaH72Z5rbQUQh2Kg
AAAEAC3uXC4MWfx7ipEa11KiCmxjTuF/90j7g9lOZO0aC8s21sVMLC5J7HyrJN40oX10gv
H5sOQBofvZnmttBRCHYqAAAAE2FsZXhAc2Vhc29uZWQubG9jYWwBAg==
-----END OPENSSH PRIVATE KEY-----Your public key will look something like this.
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIG1sVMLC5J7HyrJN40oX10gvH5sOQBofvZnmttBRCHYq alex@seasoned.localImportant
In the above example, I am generating an SSH key to be used only for Pitt’s CRCD services.
This way, if the key is compromised in any way, my only service at risk is CRCD.
I immediately deleted the above SSH keys and removed the public key from user@host:~/.ssh/authorized_keys as they are now (intentionally) compromised.
Once we have generated our public and private key, we have to copy the public key to the server we want to use this SSH key for.
We do this with the ssh-copy-id command on our local computer.
ssh-copy-id -i ~/.ssh/mykey.pub user@hostIn my particular case, it would be:
ssh-copy-id -i ~/.ssh/id_crcd.pub amm503@h2p.crc.pitt.eduHere is what happens when I run the command.
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/Users/alex/.ssh/id_crcd.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
amm503@h2p.crc.pitt.edu's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh -i /Users/alex/.ssh/id_crcd 'amm503@h2p.crc.pitt.edu'"
and check to make sure that only the key(s) you wanted were added.This stores your public key in the file ~/.ssh/authorized_keys on the server.
SLURM
The Simple Linux Utility for Resource Management (SLURM) is the workload manager that governs access to the compute resources in almost every supercomputer in the world. While login nodes are open to everyone for lightweight tasks, compute nodes are exclusive resources that must be requested. SLURM acts as the “traffic controller” for the cluster: it accepts job requests from users, prioritizes them in a queue, and allocates specific compute nodes to run the work when resources become available.
To run an analysis on the CRCD clusters, you must describe your resource requirements (e.g., “I need 1 node, 4 cores, and 10GB of RAM for 2 hours”) and submit this request to SLURM. Jobs can be submitted in two primary modes: Interactive and Batch.
Interactive Jobs
Interactive jobs allow you to work directly on a compute node in real-time. This is useful for debugging code, testing workflows, or running software that requires user input (e.g., RStudio, MATLAB GUI) without overloading the login nodes.
On CRCD systems, the specific wrapper crc-interactive is used to launch these sessions.
This command requests resources and, once granted, forwards your terminal session from the login node to a compute node.
crc-interactive --time 2:00:00 --num-nodes 1 --num-cores 4 --mem 8 --account biosc1640-2026s --teachTip
Always request realistic time limits. Shorter jobs (e.g., 1 hour) are often scheduled faster than longer jobs (e.g., 24 hours) because they can fill small gaps in the schedule.
You can run the exit command to return to the login node.
Batch Jobs
For production runs, long simulations, or pipelines that do not require human intervention, users should use batch jobs.
A batch job is defined by a script (usually a bash script ending in .slurm or .sh) that contains two parts:
- Directives: Special comments starting with
#SBATCHthat tell the scheduler what resources are needed. - Tasks: The actual shell commands to run the analysis (loading modules, running executables).
Users submit these scripts using the sbatch command.
Once submitted, the user can log off; SLURM will run the job when resources exist and save the output to a file.
Here is an example submit_job.slurm script:
#!/bin/bash
#SBATCH --job-name=demo-job
#SBATCH --cluster=teach
#SBATCH --account=biosc1640-2026s
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=1GB
#SBATCH --time=0-01:00:00
# Load necessary software modules
module purge
module load python/3.12.8
# Your calculations
python3 -c 'print("Look ma! No hands!")'
crc-job-statsTo submit this script to the scheduler:
sbatch demo-job.slurmOnce it starts running, it will create a file called slurm-_____.out with the job name.
If you cat the output of this job once it is complete, you will see the following.
Look ma! No hands!
==============================================================================
JOB STATISTICS
==============================================================================
JobId: 27755
SubmitTime: 2026-01-21T06:50:54
EndTime: 2026-01-21T07:50:55
RunTime: 00:00:01
AllocTRES: cpu=2,mem=2G,node=1,billing=2
Partition: cpu
NodeList: teach-cpu-n18
Command: /ihome/jdurrant/amm503/sandbox/demo-job/demo-job.slurm
==============================================================================
For more information use the command:
- `sacct -M teach -j 27755 -S 2026-01-21T06:50:54 -E 2026-01-21T07:50:55`
To control the output of the above command:
- Add `--format=<field1,field2,etc>` with fields of interest
- See the list of all possible fields by running: `sacct --helpformat`
==============================================================================Core SLURM Commands
Once a job is submitted, users need to monitor its status or cancel it if errors occur.
Because CRCD separates resources into different “clusters” (e.g., SMP, MPI, HTC), standard SLURM commands sometimes default to the wrong cluster.
The CRCD wrappers (prefixed with crc-) ensure commands query the entire federated system.
| Action | Standard Command | CRCD Wrapper | Description |
|---|---|---|---|
| Submit | sbatch <script> | N/A | Submits a batch script to the queue. |
| Monitor | squeue -u <user> | crc-squeue | Lists the status of jobs (Pending, Running) for a specific user. |
| Cancel | scancel <jobid> | crc-scancel <jobid> | Terminates a pending or running job. |
| Info | sinfo | crc-sinfo | Displays the status of partitions and nodes (Idle, Allocated, Down). |
| Job Details | scontrol show job <jobid> | N/A | specific technical details about a running job. |
Job States
When checking the queue, you will see an acronym under the ST (State) column.
Understanding these codes helps diagnose why a job isn’t running.
- PD (Pending): The job is waiting for resources.
- (Reason: Priority) - Other users have higher priority.
- (Reason: Resources) - The requested hardware is currently in use.
- R (Running): The job has been allocated nodes and is executing.
- CG (Completing): The job is finishing up processes.
- CD (Completed): The job finished successfully (exit code 0).
- F (Failed): The job terminated with a non-zero exit code.
- TO (Timeout): The job hit its time limit and was killed by SLURM.
Open Lab
Work on the Remote Computing Certification.
Exit Criteria
You are free to leave once you are comfortable with SLURM and made progress on the remote computing certificate.