Session 02

Date: Wednesday, Jan 21
Entry ticket: T01 and T02

Learning Objectives

  • Differentiate between Login Nodes and Compute Nodes to prevent system overload.
  • Distinguish between Home directories (/ihome) and Project storage (/ix) based on their distinct backup policies and quotas.
  • Manage software versions and dependencies using Lmod commands rather than installing software locally.
  • Configure SSH keys to enable secure, password-less authentication to the cluster.
  • Execute computational tasks by submitting SLURM requests for both Interactive sessions and Batch scripts.

Concept Check

Imagine you are studying a protein simulation. You write a single Python script that does the following in one go:

  1. Reads the massive raw text log.
  2. Filters out any time steps where the protein’s ‘Total Energy’ is positive.
  3. Calculates the average stability of the remaining steps.
  4. Plots the result.

You show the plot to your PI. They ask: ‘Wait, are we sure those positive energy states were errors? What if they were actually rare transition states? Show me the graph with those points included.’ Why is your current ‘Single Script’ architecture a disaster for answering this question?

Code Along

Remote Computing

Scientific computing often requires resources far beyond what a personal laptop can provide. Remote computing is the practice of accessing powerful, centralized computers over a network to perform these tasks. Instead of running a simulation on your physical machine, you send the instructions to a remote machine, or “host,” which executes the work and saves the results.

To navigate this environment effectively, we must define the standard architecture of a High-Performance Computing (HPC) cluster.

The Client-Server Model

The relationship between your computer and the cluster is a Client-Server model.

  • Client: Your personal computer (laptop or desktop). It initiates the connection.
  • Server (Host): The remote computer at the data center. It listens for connections and provides resources.
  • Network: The infrastructure connecting them (Internet/VPN).

Architecture of a Cluster

An HPC cluster is not a single supercomputer; it is a collection of hundreds of individual computers (called nodes) connected by high-speed networks. These nodes are specialized into different roles to maintain efficiency and security.

  flowchart LR
subgraph Local["Local Machine (Client)"]
Laptop[Your Laptop]
end

subgraph Cluster["HPC Cluster (Server Side)"]
    direction TB
    Login[<b>Login Node</b><br/>Gateway & Submission]

    subgraph Compute["Compute Farm"]
        CN1[Compute Node 1]
        CN2[Compute Node 2]
        CN3[Compute Node 3]
        CN4[Compute Node 4]
    end

    Storage[(<b>Shared Storage</b><br/>Home & Scratch)]
end

Laptop -- "SSH (Port 22)" --> Login
Login -- "sbatch / srun" --> CN1
Login -- "sbatch / srun" --> CN2
Login -- "sbatch / srun" --> CN3
Login -- "sbatch / srun" --> CN4

%% Storage connections
Login === Storage
CN1 === Storage
CN2 === Storage
CN3 === Storage
CN4 === Storage

classDef plain fill:#fff,stroke:#333,stroke-width:1px;
classDef important fill:#a2d2ff,stroke:#333,stroke-width:2px;
class Laptop,CN1,CN2,CN3,CN4 plain;
class Login,Storage important;
Login Nodes

When you connect to the cluster, you land on a Login Node. Think of this as the reception desk of a factory.

This is where you edit files, compile code, organize data, and submit jobs to the scheduler. It is public-facing (usually requires VPN) and accessible via SSH. Never run heavy computations here. Login nodes are shared by hundreds of users simultaneously. If one user maxes out the CPU on the login node, the system becomes sluggish for everyone else.

Compute Nodes

The actual heavy lifting happens on the Compute Nodes.

These nodes are optimized for raw processing power. They execute the simulations and analyses you requested. You generally do not log into these directly. Instead, you ask the scheduler (SLURM) to allocate one for you. When you are assigned a compute node, you often have exclusive or semi-exclusive access to its resources, ensuring your calculations run as fast as possible without interference.

Shared Storage

The CRCD operates several distinct file systems, each optimized for specific purposes such as configuration, active computation, or long-term archiving. Because the compute nodes and login nodes share these file systems, you can access your data from anywhere in the cluster.

  flowchart TD
  subgraph Root["/ (Root File System)"]
  direction TB

    subgraph Home["/ihome (Personal Configs)"]
        U1[User Home: /ihome/group/user]
    end

    subgraph Project["/ix & /ix1 (Project Data)"]
        P1[Group Project: /ix/group]
        P2[Group Project: /ix1/group]
    end

end

User((User)) --> U1
User --> P1
User --> P2

style Home fill:#e1f5fe,stroke:#01579b
style Project fill:#e8f5e9,stroke:#2e7d32

File System Summary

LocationPurposeQuotaBackupsSnapshots
/ihomeHome Directories. Small config files, environments, and scripts.75 GB (Fixed)DailyYes
/ix & /ix1Project Storage. The primary location for active datasets and job staging.5 TB (Expandable)None7 Days

Storage Details

Home Directories (/ihome)

This is your entry point when logging into the system. It is organized by user group: /ihome/<primary_group>/<username>.

It is best for Source code, Conda/Python environments, and Jupyter session logs. Each user only has 75 GB available. This quota is strict and cannot be increased. This is the only location that is backed up daily.

Project Data (/ix and /ix1)

These are enterprise storage locations designed for persistent file storage and heavy workloads.

This is best for staging input data for compute jobs and storing results. By default, only your PI and group members can access your folder here (/ix/groupname). Groups receive a 5 TB allocation at no cost. Additional storage can be purchased ($85/TB/year). While there are no long-term backups, 7-day snapshots are available to restore accidentally deleted files.

Warning

No Backups on Project Storage The /ix and /ix1 file systems are not backed up. If you delete a file and wait longer than 7 days, it is gone forever. Always keep copies of critical raw data on separate local storage (e.g., your lab’s OneDrive or a physical hard drive).

Managing Your Storage

Checking Quotas (crc-quota)

To check how close you are to your storage limits across all file systems, use the crc-quota wrapper command:

[amm503@login0 ~]$ crc-quota
User: 'amm503'
-> ihome: 41.43 GB / 75.0 GB

Group: 'jdurrant'
-> ix: 23.5 TB / 29.0 TB
File Permissions & Sharing

Linux uses POSIX permissions to control who can Read, Write, or eXecute files. By default, your files in /ix are visible to your group members but not to outsiders.

  • View permissions: ls -l
  • Change permissions: chmod (e.g., chmod g+rx filename gives your group read/execute access).
  • Default permissions: Determined by umask. A umask of 007 allows full group access, while 077 restricts everything to the owner only.

If you need to share data with a collaborator in a different group, standard permissions are often insufficient. The CRCD recommends using Access Control Lists (ACLs) for this purpose.

Do not try to configure complex ACLs manually if you are unsure. Submit a Help Ticket requesting a shared folder. The admins will create a dedicated directory (e.g., /ix/group/shared/folder) with the correct nfs4_setfacl permissions applied for your collaborator.

Software Modules (Lmod)

The CRCD clusters host thousands of software packages, libraries, and compilers. Installing all of these into the standard system paths (like /usr/bin) would be chaotic: different users need different versions of the same software (e.g., Python 3.8 vs. Python 3.11), and many applications have conflicting dependencies.

To manage this complexity, we use Lmod, a Lua-based environment module system. Lmod allows you to dynamically modify your user environment to access specific software versions on demand. Think of the system as a library: the books (software) are stored in the stacks, and you must check them out (module load) to use them.

How Modules Work

When you log in, your environment is relatively “clean,” containing only basic system tools. When you load a module, Lmod silently prepends paths to your environment variables (specifically $PATH and $LD_LIBRARY_PATH).

Essential Commands

The module command is your primary interface for managing software.

CommandDescription
module spider <name>Search. Finds all versions of a software package, even if they aren’t currently loadable.
module load <name>/<ver>Load. Adds the specific version of the software to your environment.
module listCheck. Shows all currently loaded modules.
module purgeClean. Unloads all modules, returning your environment to a default state.
module availBrowse. Lists only the modules that are compatible with your currently loaded compilers/libraries.

Finding Software (module spider)

New users often try module avail to find software, but this command only shows what is immediately available. Many packages on CRCD are built with specific compilers (like GCC or Intel) and are hidden until you load that compiler.

To find software, always use module spider. It searches the entire database.

[amm503@login0 ~]$ module spider python

-----------------------------------------------------------------------------------------------
  python:
-----------------------------------------------------------------------------------------------
     Versions:
        python/ondemand-jupyter-python3.8
        python/ondemand-jupyter-python3.9
        python/ondemand-jupyter-python3.11
        python/py37_venv_23.1.0
        python/pytorch_251_311_cu124
        python/tensorflow_218_311
        python/3.7.0
        python/3.7.17
        python/3.8.18
        python/3.8.20-orhs6eu
        python/3.9.18
        python/3.10.13
        python/3.11.6
        python/3.11.9
        python/3.11.11-fayknjn
        python/3.12.0
        python/3.12.8
     Other possible modules matches:
        openslide-python  py-biopython  py-bx-python  py-gitpython  py-ipython  py-ipython-genutils  ...

-----------------------------------------------------------------------------------------------
  To find other possible module matches execute:

      $ module -r spider '.*python.*'

-----------------------------------------------------------------------------------------------
  For detailed information about a specific "python" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider python/3.12.8
-----------------------------------------------------------------------------------------------

If you run the specific command suggested in the output, Lmod will tell you exactly which dependencies you need to load first.

Loading Software (module load)

Once you know the module name, load it. It is best practice to always specify the version number explicitly. If you omit the version, Lmod will load the default (usually the latest), which may change unexpectedly and break your scripts in the future.

# Good Practice: Explicit versioning
module load gcc/12.2.0 python/3.11.4

# Risky Practice: Implicit defaults
module load gcc python

Warning

Avoid module load in .bashrc It is tempting to add module load commands to your ~/.bashrc file so your favorite tools are always ready. Do not do this. It can cause silent failures in batch jobs, interfere with the X2Go desktop, and break the Open OnDemand portal. Always load modules inside your job scripts or interactive sessions.

Managing Conflicts (module purge)

Because different software chains can conflict (e.g., trying to load an Intel-compiled library while using the GCC compiler), it is good hygiene to clean your environment before starting a new task. This is especially important in batch scripts (.slurm files) to ensure the job runs in a predictable environment.

# In a job script:
module purge               # Clear everything
module load gcc/12.2.0     # Load only what is needed
module load openmpi/4.1.5

Hardware

The CRCD operates several HPC clusters, each optimized for a different use case. Users can connect these systems using a shared pool of login nodes, which serve as the primary entry point for submitting jobs. Selecting the appropriate cluster ensures efficient resource use and reduces wait times in the job queue.

Login

Login nodes serve as the primary entry point for accessing CRCD systems. They provide secure command-line access over SSH, allowing users to submit jobs, manage files, and launch interactive tasks.

Login nodes are a shared resource that provide fast network access and responsive storage for tasks such as preparing data, compiling code, or submitting jobs. They are not intended for running large-scale analysis or compute-heavy tasks. Running intensive computations directly on login nodes can slow down systems for all users, and built-in limits on CPU, memory, and runtime may prevent larger jobs from completing successfully. For analysis and compute-intensive tasks, users should submit jobs to the compute nodes, which are designed to handle high-performance workloads.

hostname 1backend hostnameArchitectureCores/NodeMem (Mem/Core)OS Drive
h2p.crc.pitt.edulogin0.crc.pitt.eduIntel Xeon Gold 6326 (Ice Lake)32256 GB (8 GB)2x 480 GB NVMe (RAID 1)
login1.crc.pitt.eduIntel Xeon Gold 6326 (Ice Lake)32256 GB (8 GB)2x 480 GB NVMe (RAID 1)
htc.crc.pitt.edulogin3.crc.pitt.eduIntel Xeon Gold 6326 (Ice Lake)32256 GB (8 GB)2x 480 GB NVMe (RAID 1)

SMP

The SMP cluster is designed for workloads that run on a single node using shared memory parallelism. Each node provides multiple CPU cores with access to a common memory space, making the cluster well-suited for multithreaded applications. OpenMP codes and jobs that do not require distributed computing across multiple nodes.

Partition2Host Architecture--constraintNodesCores/NodeMem/Node (Mem/Core)ScratchNode Names
smpAMD EPYC 9374F (Genoa)amd, genoa4364768 GB (12 GB)3.2 TB NVMesmp-n[214-256]
AMD EPYC 7302 (Rome)amd, rome5532256 GB (8 GB)1 TB SSDsmp-n[156-210]
high-memIntel Xeon Platinum 8352Y (Ice Lake)intel, ice_lake8641 TB (16 GB)10 TB NVMesmp-1024-n[1-8]
Intel Xeon Platinum 8352Y (Ice Lake)intel, ice_lake2642 TB (32 GB)10 TB NVMesmp-2048-n[0-1]
AMD EPYC 7351 (Naples)amd, naples1321 TB (32 GB)1 TB NVMesmp-1024-n0

MPI

The MPI cluster is optimized to support parallel workloads running across many nodes at once. High-speed networking enables low-latency communication between processes, making the cluster ideal for tightly coupled codes that use message-passing interfaces or other distributed frameworks.

Jobs on the MPI cluster are allocated a minimum of 2 nodes. Users who regularly run single-node workloads should submit to the SMP cluster instead, which is specifically designed for single-node jobs. While running single-node jobs on MPI is allowed, resources allocated on the unused node(s) will still be counted against the user’s consumed service units.

PartitionHost ArchitectureNodesCores/NodeMem/Node (Mem/Core)ScratchNetworkNode Names
mpiIntel Xeon Gold 6342 (Ice Lake)13648512 GB (10.6 GB)1.6 TB NVMeHDR200; 10GbEmpi-n[0-135]
ndrAMD EPYC 9575F181281.5 TB (11.2 GB)2.9 TB NVMeNDR200; 10GbEmpi-n[136-153]
opa-high-memIntel Xeon Gold 6132 (Skylake)3628192 GB (6.8 GB)500 TB SSDOPA; 10GbEopa-n[96-131]

HTC

The HTC cluster is designed for data-intensive health science workflows such as genomics and neuroimaging. Jobs run on single nodes and are well-suited for high-throughput pipelines that process many independent tasks in parallel.

Resource allocation on HTC is prioritized for projects funded by the National Institute of Health (NIH). Non-NIH projects may also use the cluster, but users who are not running biomedical workloads or do not require hardware specific to the HTC cluster are encouraged to use the MPI or SMP clusters instead.

Partition2Host Architecture--constraintNodesCores/NodeMem/Node (Mem/Core)ScratchNode Names
htcAMD EPYC 9374F (Genoa)amd, genoa2064768 GB (12 GB)3.2 TB NVMehtc-n[50-69]
Intel Xeon Platinum 8352Y (Ice Lake)intel, ice_lake1864512 GB (8 GB)2 TB NVMehtc-n[32-49]
Intel Xeon Platinum 8352Y (Ice Lake)intel, ice_lake4641 TB (16 GB)2 TB NVMehtc-1024-n[0-3]
Intel Xeon Gold 6248R (Cascade Lake)intel, cascade_lake848768 GB (16 GB)960 GB SSDhtc-n[24-31]

GPU

The GPU cluster is optimized for workloads requiring GPU acceleration, including machine learning, molecular dynamics simulations, and large-scale data analysis. The cluster supports CUDA, TensorFlow, PyTorch, and other GPU-accelerated frameworks. Users who do not require GPU resources are strongly encouraged to leverage the MPI or SMP clusters instead.

Partition NameNode CountGPU TypeGPU/Node--constraintHost ArchitectureCore/NodeMax Core/GPUMem/Node (Mem/Core)ScratchNetworkNode Names
l40s20L40S 48GB4l40s,48g,intelIntel Xeon Platinum 8462Y+6416512 GB (8 GB)7 TB NVMe10GbEgpu-n[55-74]
a10010A100 40GB PCIe4a100,40g,amdAMD EPYC 7742 (Rome)6416512 GB (8 GB)2 TB NVMeHDR200; 10GbEgpu-n[35-44]
2A100 40GB PCIe4a100,40g,intelIntel Xeon Gold 5220R (Cascade Lake)4812384 GB (8 GB)1 TB NVMe10GbEgpu-n[33-34]
a100_multi10A100 40GB PCIe4a100,40g,amdAMD EPYC 7742 (Rome)6416512 GB (8 GB)2 TB NVMeHDR200; 10GbEgpu-n[45-54]
a100_nvlink2A100 80GB SXM8a100,80g,amdAMD EPYC 7742 (Rome)128161 TB (8 GB)2 TB NVMeHDR200; 10GbEgpu-n[31-32]
3A100 40GB SXM8a100,40g,amdAMD EPYC 7742 (Rome)128161 TB (8 GB)12 TB NVMeHDR200; 10GbEgpu-n[28-30]

TEACH

The Teach cluster provides dedicated resources for classroom instruction and coursework. It provides a stable environment for students to learn HPC concepts, run assignments, and develop workflows without competing with production research workloads for resources. The Teach cluster is not suitable for research work and should only be used to run jobs associated with classroom activities.

Resource TypeNode CountCPU ArchitectureCore/NodeCPU Memory (GB)GPU CardNo. GPUGPU Memory (GB)
CPU54Gold 6126 Skylake 12C 2.6GHz24192N/AN/AN/A
GPU 17E5-2620v3 Haswell 6C 2.4GHz12128NVIDIA Titan X412
GPU 26E5-2620v3 Haswell 6C 2.5GHz12128NVIDIA GTX 108048
GPU 310Xeon 4112 Skylake 4C 2.6GHz896NVIDIA GTX 1080 Ti411
GPU 42Xeon Platinum 8502+ 1.9GHz128512NVIDIA L4824

SSH Connection using a terminal

SSH (Secure Shell) is a network protocol that allows for secure access to a computer over an unsecured network. This is the protocol for connecting to the CRCD login nodes.

As with any other method for connecting to CRCD resources, you should start by ensuring you have a proper connection to the GlobalProtect VPN. With this connection established, you can proceed with the steps below.

Clients running Windows can use WSL, MobaXterm, PuTTY to access a terminal emulator. Clients running MacOS or Linux can use the built-in terminal app (in Applications/Utilities).

To render graphics from the remote session, you will also need an X server on your client.

Here are the connection details:

  • Connection protocol: ssh
  • Remote hostname: h2p.crc.pitt.edu or htc.crc.pitt.edu
  • Authentication credentials: Pitt username (all lowercase) and password

The syntax to connect to the CRCD login node from your terminal commandline is

ssh username@h2p.crc.pitt.edu

where username is your Pitt username in lowercase and the answer to the prompt is the corresponding password.

Tip

If you enter the command and nothing happens for more than five seconds, that usually indicates you do not have an active GlobalProtect VPN connection. You can use CTRL + C to send the SIGINT (i.e., the interrupt signal) to quit the command.

Once successful, you will see this splash screen.

###############################################################################

                         Welcome to h2p.crc.pitt.edu!

      Documentation can be found at https://crc-pages.pitt.edu/user-manual/

-------------------------------------------------------------------------------

                             IMPORTANT REMINDERS

 Don't run jobs on login nodes! Use interactive jobs: `crc-interactive --help`

 Slurm is separated into 'clusters', e.g. if `scancel <jobnum>` doesn't work
      try `crc-scancel <jobnum>`. Try `crc-sinfo` to see all clusters.

-------------------------------------------------------------------------------

###############################################################################

SSH Keys

The aforementioned SSH connection steps will work, but they require human input in multiple steps. We can set up SSH keys and configurations to log into our cluster with just one short alias.

First, we have to generate a SSH key using ssh-keygen using Ed25519.

ssh-keygen -t ed25519

It will ask you a few questions, such as where you want to save the file and if you want to add a password.

Generating public/private Ed25519 key pair.
Enter file in which to save the key (/Users/alex/.ssh/id_ed25519): /Users/alex/.ssh/id_crcd
Enter passphrase for "/Users/alex/.ssh/id_crcd" (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/alex/.ssh/id_crcd
Your public key has been saved in /Users/alex/.ssh/id_crcd.pub
The key fingerprint is:
SHA256:lHIXSnWohFhdcRPsLnAq+C0ZzCftrBJSDejfyMU6rig alex@seasoned.local
The key's randomart image is:
+--[ED25519 256]--+
|   . o.o.o=+=.   |
|  . o ..oo.+..   |
| .   +..=...     |
|  . . ++o.. .    |
|   + O .S+ .     |
|  . O B + . .    |
|   o + X   .     |
|E.  o + +        |
|o .. ..o         |
+----[SHA256]-----+

You cannot use ~ in your file path, and we recommend naming the key for the specific use case.

Your private key will look something like this.

-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
QyNTUxOQAAACBtbFTCwuSex8qyTeNKF9dILx+bDkAaH72Z5rbQUQh2KgAAAJg+kmYSPpJm
EgAAAAtzc2gtZWQyNTUxOQAAACBtbFTCwuSex8qyTeNKF9dILx+bDkAaH72Z5rbQUQh2Kg
AAAEAC3uXC4MWfx7ipEa11KiCmxjTuF/90j7g9lOZO0aC8s21sVMLC5J7HyrJN40oX10gv
H5sOQBofvZnmttBRCHYqAAAAE2FsZXhAc2Vhc29uZWQubG9jYWwBAg==
-----END OPENSSH PRIVATE KEY-----

Your public key will look something like this.

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIG1sVMLC5J7HyrJN40oX10gvH5sOQBofvZnmttBRCHYq alex@seasoned.local

Important

In the above example, I am generating an SSH key to be used only for Pitt’s CRCD services. This way, if the key is compromised in any way, my only service at risk is CRCD. I immediately deleted the above SSH keys and removed the public key from user@host:~/.ssh/authorized_keys as they are now (intentionally) compromised.

Once we have generated our public and private key, we have to copy the public key to the server we want to use this SSH key for. We do this with the ssh-copy-id command on our local computer.

ssh-copy-id -i ~/.ssh/mykey.pub user@host

In my particular case, it would be:

ssh-copy-id -i ~/.ssh/id_crcd.pub amm503@h2p.crc.pitt.edu

Here is what happens when I run the command.

/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/Users/alex/.ssh/id_crcd.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
amm503@h2p.crc.pitt.edu's password:

Number of key(s) added:        1

Now try logging into the machine, with: "ssh -i /Users/alex/.ssh/id_crcd 'amm503@h2p.crc.pitt.edu'"
and check to make sure that only the key(s) you wanted were added.

This stores your public key in the file ~/.ssh/authorized_keys on the server.

SLURM

The Simple Linux Utility for Resource Management (SLURM) is the workload manager that governs access to the compute resources in almost every supercomputer in the world. While login nodes are open to everyone for lightweight tasks, compute nodes are exclusive resources that must be requested. SLURM acts as the “traffic controller” for the cluster: it accepts job requests from users, prioritizes them in a queue, and allocates specific compute nodes to run the work when resources become available.

To run an analysis on the CRCD clusters, you must describe your resource requirements (e.g., “I need 1 node, 4 cores, and 10GB of RAM for 2 hours”) and submit this request to SLURM. Jobs can be submitted in two primary modes: Interactive and Batch.

Interactive Jobs

Interactive jobs allow you to work directly on a compute node in real-time. This is useful for debugging code, testing workflows, or running software that requires user input (e.g., RStudio, MATLAB GUI) without overloading the login nodes.

On CRCD systems, the specific wrapper crc-interactive is used to launch these sessions. This command requests resources and, once granted, forwards your terminal session from the login node to a compute node.

crc-interactive --time 2:00:00 --num-nodes 1 --num-cores 4 --mem 8 --account biosc1640-2026s --teach

Tip

Always request realistic time limits. Shorter jobs (e.g., 1 hour) are often scheduled faster than longer jobs (e.g., 24 hours) because they can fill small gaps in the schedule.

You can run the exit command to return to the login node.

Batch Jobs

For production runs, long simulations, or pipelines that do not require human intervention, users should use batch jobs. A batch job is defined by a script (usually a bash script ending in .slurm or .sh) that contains two parts:

  1. Directives: Special comments starting with #SBATCH that tell the scheduler what resources are needed.
  2. Tasks: The actual shell commands to run the analysis (loading modules, running executables).

Users submit these scripts using the sbatch command. Once submitted, the user can log off; SLURM will run the job when resources exist and save the output to a file.

Here is an example submit_job.slurm script:

#!/bin/bash
#SBATCH --job-name=demo-job
#SBATCH --cluster=teach
#SBATCH --account=biosc1640-2026s
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=1GB
#SBATCH --time=0-01:00:00

# Load necessary software modules
module purge
module load python/3.12.8

# Your calculations
python3 -c 'print("Look ma! No hands!")'

crc-job-stats

To submit this script to the scheduler:

sbatch demo-job.slurm

Once it starts running, it will create a file called slurm-_____.out with the job name. If you cat the output of this job once it is complete, you will see the following.

Look ma! No hands!
==============================================================================
                                JOB STATISTICS
==============================================================================

           JobId: 27755
      SubmitTime: 2026-01-21T06:50:54
         EndTime: 2026-01-21T07:50:55
         RunTime: 00:00:01
       AllocTRES: cpu=2,mem=2G,node=1,billing=2
       Partition: cpu
        NodeList: teach-cpu-n18
         Command: /ihome/jdurrant/amm503/sandbox/demo-job/demo-job.slurm

==============================================================================
 For more information use the command:
   - `sacct -M teach -j 27755 -S 2026-01-21T06:50:54 -E 2026-01-21T07:50:55`

 To control the output of the above command:
   - Add `--format=<field1,field2,etc>` with fields of interest
   - See the list of all possible fields by running: `sacct --helpformat`
==============================================================================

Core SLURM Commands

Once a job is submitted, users need to monitor its status or cancel it if errors occur. Because CRCD separates resources into different “clusters” (e.g., SMP, MPI, HTC), standard SLURM commands sometimes default to the wrong cluster. The CRCD wrappers (prefixed with crc-) ensure commands query the entire federated system.

ActionStandard CommandCRCD WrapperDescription
Submitsbatch <script>N/ASubmits a batch script to the queue.
Monitorsqueue -u <user>crc-squeueLists the status of jobs (Pending, Running) for a specific user.
Cancelscancel <jobid>crc-scancel <jobid>Terminates a pending or running job.
Infosinfocrc-sinfoDisplays the status of partitions and nodes (Idle, Allocated, Down).
Job Detailsscontrol show job <jobid>N/Aspecific technical details about a running job.

Job States

When checking the queue, you will see an acronym under the ST (State) column. Understanding these codes helps diagnose why a job isn’t running.

  • PD (Pending): The job is waiting for resources.
    • (Reason: Priority) - Other users have higher priority.
    • (Reason: Resources) - The requested hardware is currently in use.
  • R (Running): The job has been allocated nodes and is executing.
  • CG (Completing): The job is finishing up processes.
  • CD (Completed): The job finished successfully (exit code 0).
  • F (Failed): The job terminated with a non-zero exit code.
  • TO (Timeout): The job hit its time limit and was killed by SLURM.

Open Lab

Work on the Remote Computing Certification.

Exit Criteria

You are free to leave once you are comfortable with SLURM and made progress on the remote computing certificate.


  1. The login nodes have a 25GbE Network. ↩︎

  2. All SMP and HTC partitions have a 10GbE Network. ↩︎ ↩︎

Last updated on