Understanding Health Center Cluster Xanadu (SLURM)

Xanadu Cluster Information

  1. What is a Cluster
  2. How to obtain an account
  3. How to reset a password
  4. How to access the Xanadu cluster
  5. Resources Available
  6. Working with Slurm

 

What is a Cluster

A desktop or a laptop, in most cases, is inadequate for analysis of large scale datasets (e.g. Genomics, transcriptomics) or to run simulations (e.g. protein dockings, weather forecast).  They lack both the processing power and the RAM to carry out these analyses. This limitation can be overcome by joining many of these machines (computers) in a predefined architecture/configuration so that they act as a single unit with enhanced computational power and enabled to share resources.  This is the basic concept of a high performance cluster.  A cluster consists of a set connected computers that work together so that they can be viewed as a single system. Each computer unit in a cluster is referred as ‘node’.

The components of a cluster are usually connected through fast local area networks (“LAN”), with each node running its own instance of an operating system. Clusters improve performance and availability of a single computer. The benefits of clusters include low cost, elasticity and being able to run jobs anytime, anywhere.

 

How to Obtain an Account

Inorder to obtain an account in Xanadu, you need to have a UCH account or a CAM account. If you do not have a health center CAM account, need to request it using the following link: http://bioinformatics.uconn.edu/contact-us/ , and then select “Account Request (UCHC cluster) ” from the inquiry selection list.

Once you submit the request, an email will sent to you with instructions, on how to set up your CAM account.

 

How to Reset the Password

For some reason if you need to change the password, you have to use the following link to do it : https://vdusers.cam.uchc.edu/pm/

 

How to Access Xanadu

Health center cluster called “Xanadu” can be accessed in two ways regarding from where you are accessing it from.

  • For the users who are accessing the cluster within the health center or by using the health center VPN, can ssh using:
ssh <user_name>@xanadu-submit-int.cam.uchc.edu
  • For users who are accessing Xanadu from outside (e.g: storrs campus or using UConn VPN) need to use:
ssh <user_name>@xanadu-submit-ext.cam.uchc.edu

 

Resources Available

Xanadu cluster uses the Slurm, which is a highly scalable cluster management and job scheduling system for large and small Linux clusters. According to the job management system the nodes are divided into groups which are called partitions.  Xanadu has four such partitions namely : general, xeon, amd, himem.

 

To look up the available partition information you can use ‘sinfo -s’ which will give you the current list:

$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
general*     up   infinite        2/15/0/17  xanadu-[01-11,20-25]
xeon         up   infinite        0/11/0/11  xanadu-[01-11]
amd          up   infinite          2/4/0/6  xanadu-[20-25]
himem        up   infinite          0/4/0/4  xanadu-[30-33]

In the above the general* is the default partition for the users. Where NODES(A/I/O/T) are a count of a particular configuration by node state in the form of  “Available / Idle / Other / Total “.

 

Xanadu cluster is divided in to two main partitions:

general partition
himem partition

general partition: will consist of 17 nodes and each will have 32 to 36 cpus in a given configuration, with 128/256 GB memory.

$ sinfo -N -l -p general
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON  
xanadu-01      1  general*        idle   36   2:18:1 257676    15620      1   (null) none    
xanadu-02      1  general*        idle   36   2:18:1 257676    15620      1   (null) none    
xanadu-03      1  general*        idle   36   2:18:1 257676    15620      1   (null) none    
xanadu-04      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-05      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-06      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-07      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-08      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-09      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-10      1  general*        idle   36   2:18:1 128532    15620      1   (null) none
xanadu-11      1  general*        idle   36   2:18:1 128532    15620      1   (null) none
xanadu-20      1  general*       mixed   32    4:8:1 128745    15620      1   (null) none
xanadu-21      1  general*       mixed   32    4:8:1 128745    15620      1   (null) none
xanadu-22      1  general*   allocated   32    4:8:1 257760    15620      1   (null) none
xanadu-23      1  general*   allocated   32    4:8:1 257760    15620      1   (null) none
xanadu-24      1  general*   allocated   32    4:8:1 249696    15620      1   (null) none
xanadu-25      1  general*   allocated   32    4:8:1 209380    15620      1   (null) none

 

himem partion: will have 4 nodes with each having 32 cores and 512 GB of memory

$ sinfo -N -l -p himem

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
xanadu-30      1     himem        idle   32    4:8:1 515792    15620      1   (null) none
xanadu-31      1     himem        idle   32    4:8:1 515792    15620      1   (null) none
xanadu-32      1     himem        idle   32    4:8:1 515792    15620      1   (null) none
xanadu-33      1     himem        idle   32    4:8:1 515792    15620      1   (null) none

 

The general partition, can also be identified according to the processor type used in each node:

xeon partition (nodes consists of Xeon processors)
amd partition (nodes consists of AMD processors)

xeon partition: will have 11 nodes with each having 36 cores and 156 GB

$ sinfo -N -l -p xeon

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
xanadu-01      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-02      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-03      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-04      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-05      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-06      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-07      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-08      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-09      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none 
xanadu-10      1      xeon        idle   36   2:18:1 128532    15620      1   (null) none
xanadu-11      1      xeon        idle   36   2:18:1 128532    15620      1   (null) none

 

amd partition: will have 6 nodes with each having 32 cores and 156 GB of memory

$ sinfo -N -l -p amd

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
xanadu-20      1       amd       mixed   32    4:8:1 128745    15620      1   (null) none
xanadu-21      1       amd       mixed   32    4:8:1 128745    15620      1   (null) none
xanadu-22      1       amd   allocated   32    4:8:1 257760    15620      1   (null) none
xanadu-23      1       amd   allocated   32    4:8:1 257760    15620      1   (null) none
xanadu-24      1       amd   allocated   32    4:8:1 249696    15620      1   (null) none
xanadu-25      1       amd   allocated   32    4:8:1 209380    15620      1   (null) none

 

Summary of the nodes associated in the Xanadu cluster:

Working with Slurm

Sample script

#!/bin/bash
#SBATCH --job-name=myscript
#SBATCH -n 1
#SBATCH -N 1
#SBATCH --partition=general
#SBATCH --mail-type=END
#SBATCH --mail-user=first.last@uconn.edu
#SBATCH -o myscript.out
#SBATCH -e myscript.err

echo "Hello World"

 

A general script will consist of 3 main parts:

  • The #!/bin/bash line which allows to run as a bash script
  • Parameters for the SLURM scheduler indicated by #SBATCH
  • Command submission line(s)

 

The #SBATCH lines indicate the set of parameters for the SLURM scheduler.

#SBATCH --job-name=myscript Is the name of your script
#SBATCH -n 1 Request number of cores for your job
#SBATCH -N 1 This line requests that the cores are all on node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter -- if you request more than one core (-n > 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).
#SBATCH --partition=general This line specifies the SLURM partition (in this instance it will be the general partition) under which the script will be run
#SBATCH --mail-type=END Mailing options to indicate the state of the job. In this instance it will send a notification at the end
#SBATCH --mail-user=first.last@uconn.edu Email which the notification should be sent to
#SBATCH -o myscript.out Specifies the file to which the standard output will be appended
#SBATCH -e myscript.err Specifies the file to which standard error will be appended

 

How to Submit a job

To submit a script to the cluster can be done using the sbatch command. The above script can be submitted to the cluster using

$ sbatch myscript.sh

 

Running a Job Interactively

An interactive job can be started using srun command.
To start a bash session:

$ srun --pty bash

 

In the following example, srun executes /bin/hostname using three tasks (-n3), on a single node. Which gets the hostname and the taskid (-l) as the output.

$ srun -n3 -l /bin/hostname

2: xanadu-24.cam.uchc.edu
1: xanadu-24.cam.uchc.edu
0: xanadu-24.cam.uchc.edu

 

Monitoring a Job

To monitor a particular job squeue command can be used

$ squeue -j  201185 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            201185   general myscript    [USER_ID] R       0:29      1 xanadu-20

 

To monitor jobs submitted by a user

$ squeue -u UserID

            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            200642   general Trinity   UserID  R    3:51:29      1 xanadu-22
            200637   general Trinity   UserID  R    3:54:26      1 xanadu-21
            200633   general Trinity   UserID  R    3:55:51      1 xanadu-20

 

To monitor jobs in a particular partition:

$ squeue -p general

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            287283   general  bfo_111    User1  R   15:54:58      2 xanadu-[24-25]
            203251   general   blastp    User2  R 3-02:22:39      1 xanadu-23
            203252   general   blastp    User3  R 3-02:22:39      1 xanadu-23

 

Display information on a running/completed job sacct can be used

$ sacct

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
288052             gmap    general pi-wegrzyn          1  COMPLETED      0:0 
288052.batch      batch            pi-wegrzyn          1  COMPLETED      0:0 
288775             gmap    general pi-wegrzyn          1    RUNNING      0:0

 

$ sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 288775

       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode 
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- 
288775             gmap pi-wegrzyn    general                   1   00:02:55    RUNNING      0:0

 

To get more information about a specific job scontrol can be used

$ scontrol show jobid 900001

JobId=900001 JobName=blast
   UserId=USER1(#####) GroupId=domain users(#####) MCS_label=N/A
   Priority=5361 Nice=0 Account=pi QOS=general
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=01:39:25 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-06-27T16:51:36 EligibleTime=2017-06-27T16:51:36
   StartTime=2017-06-27T16:51:36 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=general AllocNode:Sid=xanadu-submit-int:27120
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=xanadu-24
   BatchHost=xanadu-24
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=blast.sh
   WorkDir=Plant
   StdErr=blast.err
   StdIn=/dev/null
   StdOut=blast.out
   Power=

 

How to Kill a job

If you need to stop a job which you have submitted, you can use the command scancel with the JobID number:

$ scancel  [jobID number]

 

To terminate all your jobs:

$ scancel -u [UserID]