Understanding the Xanadu HPC Resource

  1. What is a cluster
  2. How to obtain an account
  3. How to reset a password
  4. How to access the Xanadu cluster
  5. HPC resources and limits
  6. Working with Slurm
  7. File system
  8. How to Transfer Data Between Clusters
  9. Running a job interactively

What is a cluster

A desktop or a laptop, in most cases, is inadequate for analysis of large scale datasets (e.g. Genomics) or to run simulations (e.g. protein docking).  They lack both the processing power and the memory to execute these analyses. This limitation can be overcome by combining machines (computers) in a predefined architecture/configuration so that they act as a single unit with enhanced computational power and shared resources.  This is the basic concept of a high performance cluster.  A cluster consists of a set connected computers that work together so that they can be viewed as a single system. Each computer unit in a cluster is referred as ‘node’.

The components of a cluster are usually connected through fast local area networks (“LAN”), with each node running its own instance of an operating system.  The benefits of clusters include low cost, elasticity and the ability to run jobs anytime.   anywhere.

Cluster Etiquette

  • Never run anything on the head node of a cluster (where you first login).
  • Keep track of what you are running and refrain from using all nodes at the same time.
  • Run and write your output files in your home directory or designated locations.

How to obtain an account

To obtain an account in Xanadu, you must have a UCH account, also known as a CAM account. The following link will allow you to request this: http://bioinformatics.uconn.edu/contact-us/

Select: “Account Request (Xanadu cluster) ” from the list on the contact-us page.

Once you submit the request, you will receive a CAM form to fill out and upon approval, you will be able to access this cluster.

How to reset the password

An interface exists to reset your password here: https://vdusers.cam.uchc.edu/pm/

Your CAM credentials will allow you to access this resource.

How to access Xanadu

Xanadu can be accessed from internal to UCH as well as external (on or off campus).

  • For the users who are accessing the cluster within UCH or via the UCH VPN (accessible via your CAM credentials), SSH with:
ssh <user_name>@xanadu-submit-int.cam.uchc.edu
  • For users who are accessing Xanadu from outside (e.g: Storrs campus or via UConn Storrs VPN), SSH with:
ssh <user_name>@xanadu-submit-ext.cam.uchc.edu

Connecting to the Cluster using Windows Computer (Putty)

Windows users will need to use an SSH client to connect to the cluster.Install Putty and configure for use:

Putty Configuration steps.

Open Putty it will open Window1 (see below).

  1. Provide host name e.g. <user_name>@xanadu-submit-ext.cam.uchc.edu or <user_name>@xanadu-submit-int.cam.uchc.edu
  2. Expand SSH tab and select X11 (shown in window2)
  3. Enable X11 forwarding by selecting it. (window2)
  4. Scroll up the left panel and select Session.(window1)
  5. Name your session e.g. Xanadu_cluster and click save tab to save.
  6. Your session name should appear in saved sessions.
  7. Double click on your session name to connect to server with SSH session.

Picture1

Connecting to the Cluster with Graphical Interface enable

In order to display graphics from the cluster, we need a software that allows one to use Linux graphical applications remotely. Xming and Xquartz are the display options available for windows and mac respectively. (Download and install it on your local computers)

Windows: Xming

Mac: Xquatz

NOTE: Start the X Server on your machine (Xming/Xquartz), each time you reboot your PC or whenever you want to use X Windows. Once enabled, Xming will appear in your system tray as a black X with an orange circle around the middle.

To log-in to the head node of the cluster run the command Mac or Linux terminal:

ssh -X <user_name>@xanadu-submit-int.cam.uchc.edu

or

ssh -X <user_name>@xanadu-submit-ext.cam.uchc.edu

Connecting to the Cluster from outside of UConn

When you are outside of UConn, to connect to the cluster you need to use a VPN connection before connecting to the server.

Information on how to connect via Virtual Private Network (VPN) client can be found at the following link: http://remoteaccess.uconn.edu/vpn-overview/connect-via-vpn-client-2/

The server URL to connect to the UConn VPN which requires a NETID login/password:

sslvpn.uconn.edu

The server URL to connect to UCHC VPN which requires the CAM login/password:

http://vpn.uchc.edu/cam

HPC resources and limits

Xanadu cluster uses the Slurm, which is a highly scalable cluster management and job scheduling system for large and small Linux clusters.  The nodes (individual nodes within the cluster) are divided into groups which are called partitions.  Xanadu has several partitions available: general, xeon, amd, himem1, himem2, himem3, himem4.

To look up the available partition information you can use ‘sinfo -s’ which will give you the current list:

$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
general*     up   infinite        2/15/0/17  xanadu-[01-11,20-25]
xeon         up   infinite        0/11/0/11  xanadu-[01-11]
amd          up   infinite          2/4/0/6  xanadu-[20-25]
himem1       up   infinite          0/1/0/1  xanadu-30
himem2       up   infinite          0/1/0/1  xanadu-31
himem3       up   infinite          0/1/0/1  xanadu-32
himem4       up   infinite          0/1/0/1  xanadu-33

In the above the general* is the default partition for the users. Where NODES(A/I/O/T) are a count of a particular configuration by node state in the form of  “Available / Idle / Other / Total “.

Xanadu cluster is divided into five partitions:

general partition
himem1 partition
himem2 partition
himem3 partition
himem4 partition

general partition: will consist of 17 nodes and each will have 32 to 36 cpus in a given configuration, with 128/256 GB memory.

$ sinfo -N -l -p general
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON  
xanadu-01      1  general*        idle   36   2:18:1 257676    15620      1   (null) none    
xanadu-02      1  general*        idle   36   2:18:1 257676    15620      1   (null) none    
xanadu-03      1  general*        idle   36   2:18:1 257676    15620      1   (null) none    
xanadu-04      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-05      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-06      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-07      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-08      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-09      1  general*        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-10      1  general*        idle   36   2:18:1 128532    15620      1   (null) none
xanadu-11      1  general*        idle   36   2:18:1 128532    15620      1   (null) none
xanadu-20      1  general*       mixed   32    4:8:1 128745    15620      1   (null) none
xanadu-21      1  general*       mixed   32    4:8:1 128745    15620      1   (null) none
xanadu-22      1  general*   allocated   32    4:8:1 257760    15620      1   (null) none
xanadu-23      1  general*   allocated   32    4:8:1 257760    15620      1   (null) none
xanadu-24      1  general*   allocated   32    4:8:1 249696    15620      1   (null) none
xanadu-25      1  general*   allocated   32    4:8:1 209380    15620      1   (null) none

himem partion: will have 4 separate highmem partitions  with each having 32 cores and 512 GB of memory

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
xanadu-30      1     himem1       idle   32    4:8:1 515792    15620      1   (null) none
xanadu-31      1     himem2       idle   32    4:8:1 515792    15620      1   (null) none
xanadu-32      1     himem3       idle   32    4:8:1 515792    15620      1   (null) none
xanadu-33      1     himem4       idle   32    4:8:1 515792    15620      1   (null) none

The general partition, can be further divided according to the processor type used in each node:

xeon partition (nodes consists of Xeon processors)
amd partition (nodes consists of AMD processors)

xeon partition: will have 11 nodes with each having 36 cores and 156 GB

$ sinfo -N -l -p xeon

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
xanadu-01      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-02      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-03      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-04      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-05      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-06      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-07      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-08      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none
xanadu-09      1      xeon        idle   36   2:18:1 257676    15620      1   (null) none 
xanadu-10      1      xeon        idle   36   2:18:1 128532    15620      1   (null) none
xanadu-11      1      xeon        idle   36   2:18:1 128532    15620      1   (null) none

amd partition: will have 6 nodes with each having 32 cores and 156 GB of memory

$ sinfo -N -l -p amd

NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
xanadu-20      1       amd       mixed   32    4:8:1 128745    15620      1   (null) none
xanadu-21      1       amd       mixed   32    4:8:1 128745    15620      1   (null) none
xanadu-22      1       amd   allocated   32    4:8:1 257760    15620      1   (null) none
xanadu-23      1       amd   allocated   32    4:8:1 257760    15620      1   (null) none
xanadu-24      1       amd   allocated   32    4:8:1 249696    15620      1   (null) none
xanadu-25      1       amd   allocated   32    4:8:1 209380    15620      1   (null) none

Summary of the nodes associated in the Xanadu cluster:

 

Working with Slurm

Sample script for standard job submission.

#!/bin/bash
#SBATCH --job-name=myscript
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 4
#SBATCH --partition=general
#SBATCH --mail-type=END
#SBATCH --mem=50G
#SBATCH --mail-user=first.last@uconn.edu
#SBATCH -o myscript_%j.out
#SBATCH -e myscript_%j.err

echo "Hello World"

A general script will consist of 3 main parts:

  • The #!/bin/bash line which allows to run as a bash script
  • Parameters for the SLURM scheduler indicated by #SBATCH
  • Command submission line(s) which comes from your selected application

The #SBATCH lines indicate the set of parameters for the SLURM scheduler.

#SBATCH --job-name=myscript Is the name of your script
#SBATCH -n 1 --ntasks Number of Task to run. The default is one task per node.
#SBATCH -N 1 --nodes This line requests that the task (-n) and cores requested (-c) are all on same node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter -- if you request more than one core (-n > 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).
#SBATCH --c 4 --cpus-per-task number of cpus requested per task.
#SBATCH --partition=general This line specifies the SLURM partition (in this instance it will be the general partition) under which the script will be run
#SBATCH --mail-type=END Mailing options to indicate the state of the job. In this instance it will send a notification at the end
#SBATCH --mem=50G 50Gb of memory requested
#SBATCH --mail-user=first.last@uconn.edu Email which the notification should be sent to
#SBATCH -o myscript_%j.out Specifies the file to which the standard output will be appended, %j will add JOBID number to file name.
#SBATCH -e myscript_%j.err Specifies the file to which standard error will be appended, %j will add JOBID number to file name.

*To submit a job to a particular partition –partition must be specified in the script

How to submit a job

To submit a script to the cluster can be done using the sbatch command.  Note:

Number of jobs that can be submitted per user  :  No limit

Maximum number of Jobs (single/within-array) running per user :  100 

 All scripts are submitted to the cluster with the following command:

$ sbatch myscript.sh

Monitoring a submitted job

To monitor all the jobs squeue can be used

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            246233   general  STO_001    USER1 CG       4:48      1 xanadu-21
            301089     himem ProtMasW    USER2 PD       0:00      1 (Priority)
            301013       amd ProtMasN    USER2  R    5:43:21      1 xanadu-24
            301677   general mv_db.sh    USER3  R      14:48      1 xanadu-22
            297400     himem  bfo_111    USER4  R 1-07:16:26      4 xanadu-[30-33]

It will give information on the jobs on all partitions. One important aspect is the state of the job in the queue.
Where;

R     – Running
PD  – Pending
CG  – Cancelled

To monitor a particular job squeue command can be used.  In this example, the jobID is 201185.  This number is provided at the time of job submission and can be used to reference the job while it is running and after it has completed.

$ squeue -j 201185 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            201185   general myscript    [USER_ID] R       0:29      1 xanadu-20

 

To monitor jobs submitted by a user

$ squeue -u UserID

            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            200642   general Trinity   UserID  R    3:51:29      1 xanadu-22
            200637   general Trinity   UserID  R    3:54:26      1 xanadu-21
            200633   general Trinity   UserID  R    3:55:51      1 xanadu-20

To monitor jobs in a particular partition:

$ squeue -p general

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            287283   general  bfo_111    User1  R   15:54:58      2 xanadu-[24-25]
            203251   general   blastp    User2  R 3-02:22:39      1 xanadu-23
            203252   general   blastp    User3  R 3-02:22:39      1 xanadu-23

 

Display information on a running/completed job sacct can be used

$ sacct

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
288052             gmap    general pi-wegrzyn          1  COMPLETED      0:0 
288052.batch      batch            pi-wegrzyn          1  COMPLETED      0:0 
288775             gmap    general pi-wegrzyn          1    RUNNING      0:0

 

$ sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 288775

       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode 
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- 
288775             gmap pi-wegrzyn    general                   1   00:02:55    RUNNING      0:0

 

To get more information about a specific job scontrol can be used

$ scontrol show jobid 900001

JobId=900001 JobName=blast
   UserId=USER1(#####) GroupId=domain users(#####) MCS_label=N/A
   Priority=5361 Nice=0 Account=pi QOS=general
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=01:39:25 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-06-27T16:51:36 EligibleTime=2017-06-27T16:51:36
   StartTime=2017-06-27T16:51:36 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=general AllocNode:Sid=xanadu-submit-int:27120
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=xanadu-24
   BatchHost=xanadu-24
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=blast.sh
   WorkDir=Plant
   StdErr=blast.err
   StdIn=/dev/null
   StdOut=blast.out
   Power=

How to cancel a job after submission:

If you need to stop a job which you have submitted, you can use the command scancel with the JobID number:

$ scancel  [jobID number]

To terminate all your jobs:

$ scancel -u [UserID]

File System Information:

/home/CAM/username : This is your home directory.  This is the default location when you login to the system.  You can run any analysis from this directory and you have ~10TB of space available here.

/UCHC/LABS/pi-name :** This is a collaborative resource created on request to share data with other members of your lab.  Please contact the CBC to request a folder for your lab.

/UCHC/PROJECTS/project name :** This is a collaborative resource created on request to share data with other project members.  Please contact the CBC to request a folder for your lab.

/linuxshare/users/username:** This directory exists to archive data longterm (compressed and tarred).  This space is available upon request.  It is equivalent to /archive on BBC in that in should not be used as a location for running analysis.

/linuxshare/projects/project name:**# This directory exists to archive data longterm for specific projects (compressed and tarred). This space is available upon request.  It is equivalent to /archive on BBC in that in should not be used as a location for running analysis.

NOTE:

# /linuxshare directories are archival as reads and writes will be slower than other repositories

** Please place a request for these directories via the contact-us form by selecting the “Bioinformatics and Technical support” option.  Once created users by the Xanadu administrators, users can populate them as they wish.

Transferring Data Between Clusters:

In order to transfer data directory between clusters it is a good practice to tar the directory and generate md5sum string for the tarred file before transferring . This string should match with md5sum value generated after transferring the file. This is to ensure complete and uncorrupted transfer of data. Steps are listed here,

 

(1) Tar the directory

data : Directory with data to be tarred

dataset.tar : Name of the output tarred file.  It has to be specified.

$ tar –cvf dataset.tar data

(2) Create md5sum string

$ md5sum dataset.tar

(3) Transfer file

$ scp [source] [destination]

case1: When logged on BBC and transferring file to Xanadu

$ scp path/to/dataset.tar user_name@xanadu-submit-ext.cam.uchc.edu:<Path to Home Dir>

case2: When logged on Xanadu and transferring file from BBC

$ scp user_name@bbcsrv3.biotech.uconn.edu:<path/to/dataset.tar> <Path to Home Dir>

In both cases it will prompt your for password, this password is for the cluster in which you are not currently logged on.

After transfer check the md5sum of transferred file at destination.

$ md5sum dataset.tar

The value or string should match the string generated in step 2.
Untar the file

$ tar –xvf dataset.tar

Running a job interactively:

The first point of contact for a user on the cluster is the head-node/submit-node. Any command that is given at the prompt is executed by the head-node/submit-node. This is not desirable as head-node/submit-node has a range of tasks to perform and its use for computational purpose will slow down its performance.  However, often it is convenient/desired to run some commands on command line rather than running them through a script.  This can be achieved by initiating an “Interactive session” by executing commands on compute-nodes.

How to Start a Interactive Session

Interactive sessions are allowed through internal and external submit nodes. There are no big differences between the internal and external submit nodes except that the external submit node can be reached without using the VPN (explained below). It is also not possible to ssh from the external submit node to other internal servers.

Once logged in start an interactive session using srun command

To start a bash session:

$ srun --pty bash

In the following example, srun executes /bin/hostname using three tasks (-n3), on a single node. Which gets the hostname and the taskid (-l) as the output.

$ srun -n3 -l /bin/hostname

2: xanadu-24.cam.uchc.edu
1: xanadu-24.cam.uchc.edu
0: xanadu-24.cam.uchc.edu

 

How to access internal submit node through VPN

Access through internal submit node is possible after establishing UCHC VPN connection.  This can be done using pulse secure.

  1. Open Pulse secure 
  2. Add new connection
  3. Set Server URL to : vpn.uchc.edu/cam
  4. Save
  5. Connect and login with CAM ID and Passwd

 

Once a vpn connection is established, login to internal submit node

ssh user_name@xanadu-submit-int.cam.uchc.edu