HPCC Cluster uses Slurm
as queuing and load balancing system. To control user traffic, any type of compute intensive jobs need to be submitted via sbatch
or srun
(see below) to the computer nodes. Much more detailed information on this topic can be found on these sites:
Job submission with sbatch
Print information about queues/partitions available on a cluster.
sinfo
[ Scroll down to continue ]
Compute jobs are submitted with sbatch
via a submission script (here script_name.sh
).
sbatch script_name.sh
The following sample submission script (script_name.sh
) executes an R script named my_script.R
.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=1-00:15:00 # 1 day and 15 minutes
#SBATCH --mail-user=useremail@address.com
#SBATCH --mail-type=ALL
#SBATCH --job-name="some_test"
#SBATCH -p batch # Choose queue/parition from: intel, batch, highmem, gpu, short
Rscript my_script.R
STDOUT
and STDERROR
of jobs will be written to files named slurm-<jobid>.out
or to a custom file specified under #SBATCH --output
in the submission script.
Interactive sessions with srun
This option logs a user in to a computer node of a specified partition (queue), while Slurm monitors and controls the resource request.
srun --pty bash -l
Interactive session with specific resource requests
srun --x11 --partition=short --mem=2gb --cpus-per-task 4 --ntasks 1 --time 1:00:00 --pty bash -l
The argument --mem
limits the amount of RAM, --cpus
the number of CPU cores, --time
the time how long a session will be active. Under --parition
one can choose among different queues and node architectures. Current options under --partition
for most users of the HPCC cluster are: intel
, batch
, highmem
, gpu
, and short
. The latter has a time limit of 2 hours.
Monitoring jobs with squeue
List all jobs in queue
squeue
List jobs of a specific user
squeue -u <user>
Print more detailed information about a job
scontrol show job <JOBID>
Custom command to summarize and visualize cluster activity
jobMonitor
Deleting and altering jobs
Delete a single job
scancel -i <JOBID>
Delete all jobs of a user
scancel -u <username>
Delete all jobs of a certain name
scancel --name <myJobName>
Altering jobs with scontrol update
. The below example changes the walltime (<NEW_TIME>
) of a specific job (<JOBID>
).
scontrol update jobid=<JOBID> TimeLimit=<NEW_TIME>
Resource limits
Resourse limits for users can be viewed as follows.
sacctmgr show account $GROUP format=Account,User,Partition,GrpCPUs,GrpMem,GrpNodes --ass | grep $USER
Similarly, one can view the limits of the group a user belongs to.
sacctmgr show account $GROUP format=Account,User,Partition,GrpCPUs,GrpMem,GrpNodes,GrpTRES%30 --ass | head -3