HPCC Cluster uses Slurm as queuing and load balancing system. To control user traffic, any type of compute intensive jobs need to be submitted via sbatch or srun (see below) to the computer nodes. Much more detailed information on this topic can be found on these sites:
Job submission with sbatch
Print information about queues/partitions available on a cluster.
sinfo
[ Scroll down to continue ]
Compute jobs are submitted with sbatch via a submission script (here script_name.sh).
sbatch script_name.sh
The following sample submission script (script_name.sh) executes an R script named my_script.R.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=1-00:15:00 # 1 day and 15 minutes
#SBATCH --mail-user=useremail@address.com
#SBATCH --mail-type=ALL
#SBATCH --job-name="some_test"
#SBATCH -p batch # Choose queue/parition from: intel, batch, highmem, gpu, short
Rscript my_script.R
STDOUT and STDERROR of jobs will be written to files named slurm-<jobid>.out or to a custom file specified under #SBATCH --output in the submission script.
Interactive sessions with srun
This option logs a user in to a computer node of a specified partition (queue), while Slurm monitors and controls the resource request.
srun --pty bash -l
Interactive session with specific resource requests
srun --x11 --partition=short --mem=2gb --cpus-per-task 4 --ntasks 1 --time 1:00:00 --pty bash -l
The argument --mem limits the amount of RAM, --cpus the number of CPU cores, --time the time how long a session will be active. Under --parition one can choose among different queues and node architectures. Current options under --partition for most users of the HPCC cluster are: intel, batch, highmem, gpu, and short. The latter has a time limit of 2 hours.
Monitoring jobs with squeue
List all jobs in queue
squeue
List jobs of a specific user
squeue -u <user>
Print more detailed information about a job
scontrol show job <JOBID>
Custom command to summarize and visualize cluster activity
jobMonitor
Deleting and altering jobs
Delete a single job
scancel -i <JOBID>
Delete all jobs of a user
scancel -u <username>
Delete all jobs of a certain name
scancel --name <myJobName>
Altering jobs with scontrol update. The below example changes the walltime (<NEW_TIME>) of a specific job (<JOBID>).
scontrol update jobid=<JOBID> TimeLimit=<NEW_TIME>
Resource limits
Resourse limits for users can be viewed as follows.
sacctmgr show account $GROUP format=Account,User,Partition,GrpCPUs,GrpMem,GrpNodes --ass | grep $USER
Similarly, one can view the limits of the group a user belongs to.
sacctmgr show account $GROUP format=Account,User,Partition,GrpCPUs,GrpMem,GrpNodes,GrpTRES%30 --ass | head -3